I read this very interesting article on Arstechnica on how Selerity reported Twitter’s earnings earlier than Twitter did. What stood out was that it was done through the use of advanced NLP and text mining.
Algorithms and machines are getting more advanced by the day and I agree with Andrew Brook (writer of the article) that this is just the tip of the iceberg of what text analytics can do.
However I also see some serious challenges remain ahead:
1) NLP in non-English languages
English language has a well understood grammar structure and there are large number of well annotated corpuses available to build great NLP tools. However, other languages like Chinese, Thai and Japanese have very different grammar and the words used are often in context of other words in the sentence.
While research is definitely progressing in this area, it will take some time (my guesstimate is 5 to 10 years) to reach the same level as NLP for English.
What do you think? Would like anyone to be able to share examples of advanced NLP work that is already being done for non-English languages.
2) Domain specific phrasing and terminology
Context was key in being able to interpret the meaning behind the quarterly results and the article gives great examples of the challenges in processing the data. I think that NLP and text mining will continue to require specific development for specific domains / industries.
Web-ontology holds some promise but would require huge processing efforts to build and process the vast relationships established – just to be able to associate and interpret concepts across domains.
I guess this will result in NLP and text analytics remaining domain specific in the foreseeable future.
This may not be a problem for formal reports such as financial, medical and other professional publications – but to be able to crack the jungle that is user generated content (UGC) (e.g. reviews, comments, Facebook posts) understanding of colloquialism would be required for any future NLP systems that seek to mine data from UGC.
I am sure that there are other challenges but I do see that if these are overcome, then the other obstacles would be relatively easier to overcome.
I have always been fascinated with NLP and text mining and I see great possibilities in this area. Looking forward to read about the next business question answered through text analytics!