Data Scientist's Diary: April 2016

For the last year, I'm working with Natural Language Processing (mostly with Deep Learning). And I've decided to write a set of blog posts with the description of the most trend ideas in the field. So, let's start from the very beginning.

Natural Language Processing is a field at the intersection of computer science, Artificial Intelligence and linguistic. The main goal of NLP is to "understand" natural language in order to perform some useful tasks, like question answering.

Some examples of NLP applications:

Spell checking, keyword search, finding synonyms
Extracting information from websites such as time, product price, dates, location, people or company names
Classifying texts
Texts summarisation
Finding similar texts
Sentimental analysis
Machine translation
Search
Spoken dialog systems
Complex query answering
Speech recognition

Texts could be analyzed on different levels: phonemes, morphemes, words, sub-sentences, sentences, paragraphs and whole documents.

From linguistic point of view, analysis could be done on these levels:

Syntax (what is grammatical)
Semantic (what does it mean)
Pragmatics(what does it do)

There are a lot of smart algorithms, which were developed for various tasks:

Hidden Markov Models(for speech recognition)
Conditional Random Fields (for part of speech tagging)
Latent Dirichlet Allocation (for topic modeling)

NLP is hard. First of all, because of:

ambiguity - more than one possible(precise) interpretation (e.g. "Foreigners are hunting dogs"),
vagueness - does not specify full information
uncertainty - due to imperfect statistical mod

In mid-2010 Neural Nets become successful in NLP. Why did it happen?

I'll describe the main ideas of deep learning techniques for NLP in the next post :)

Data Scientist's Diary

Tuesday, April 19, 2016

Natural Language Processing. Brief intro