Text Mining and its impact

I read this very interesting article on Arstechnica on how Selerity reported Twitter’s earnings earlier than Twitter did. What stood out was that it was done through the use of advanced NLP and text mining.

http://arstechnica.com/business/2015/05/03/how-selerity-reported-twitters-2q15-earnings-before-twitter-did/

Algorithms and machines are getting more advanced by the day and I agree with Andrew Brook (writer of the article) that this is just the tip of the iceberg of what text analytics can do.

However I also see some serious challenges remain ahead:

1) NLP in non-English languages

English language has a well understood grammar structure and there are large number of well annotated corpuses available to build great NLP tools. However, other languages like Chinese, Thai and Japanese have very different grammar and the words used are often in context of other words in the sentence.

While research is definitely progressing in this area, it will take some time (my guesstimate is 5 to 10 years) to reach the same level as NLP for English.

What do you think? Would like anyone to be able to share examples of advanced NLP work that is already being done for non-English languages.

2) Domain specific phrasing and terminology

Context was key in being able to interpret the meaning behind the quarterly results and the article gives great examples of the challenges in processing the data. I think that NLP and text mining will continue to require specific development for specific domains / industries.

Web-ontology holds some promise but would require huge processing efforts to build and process the vast relationships established – just to be able to associate and interpret concepts across domains.

I guess this will result in NLP and text analytics remaining domain specific in the foreseeable future.

3) Colloquialism

This may not be a problem for formal reports such as financial, medical and other professional publications – but to be able to crack the jungle that is user generated content (UGC) (e.g. reviews, comments, Facebook posts) understanding of colloquialism would be required for any future NLP systems that seek to mine data from UGC.

I am sure that there are other challenges but I do see that if these are overcome, then the other obstacles would be relatively easier to overcome.

I have always been fascinated with NLP and text mining and I see great possibilities in this area. Looking forward to read about the next business question answered through text analytics!

Enjoy!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s