Friday, May 13, 2016

Distributed Word Representation

Today we will talk about the main "building block" in deep learning application for NLP - vectors.
Every part - phoneme, word,  sub-sentence, sentence, even the whole document could be represented as a vector. I found it really cool.
How to get this representation?
The most straightforward way is to build a word-documents matrix. This matrix will be sparse, so the next step should be a dimensionality reduction (e.g SVD).
The main problem here is an expensive computation (computation cost scales quadratically for n x m matrix O(m x n x n) when (n < m))
Another approach is to learn vector representation directly from the data. This algorithm (named word2vec) was suggested in 2013 by Mikolov. Actually, word2vec is a two algorithms: CBOW(continuous bag of words) and Skip-Gram. In CBOW you are predicting the word, based on words before and  after. In Skip-Gram the task is opposite - context prediction based on words.

With this approach, you can very quickly learn words representation(e.g words representation for all words in English wiki (~80 GB unzipped texts) could be learnt in ~ 10 hours with office laptop).
You could directly measure  the similarity between  result vectors (and get a similarity between words context  e.g. 'stock market' = 'thermometer',  with similarity equal to 0.72). Also, you could use the vectors as building blocks for more complex neural nets.
This approach unlocks really cool new operations, like adding or subtraction word representations which look like  adding or subtraction context of words.

Or even cooler:
Iraq - Violence = Jordan
President - Power = Prime Minister
Guys from Instagram applied this technique for obtaining meanings of emoji.
Example:

Interested in this topic? You can read more here:
Mikolov original paper:
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Instagram Engineering Blog:
http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji
Cool examples (I used them above) http://byterot.blogspot.co.uk/2015/06/five-crazy-abstractions-my-deep-learning-word2doc-model-just-did-NLP-gensim.html


12 comments:

  1. This is good information and really helpful for the people who need information about this.
    Data Science Training in Delhi
    Data Science Training institute in Delhi

    ReplyDelete
  2. At a high level, you can control all of these with extensive administrative controls accessible via a secure Web client.For more information visit
    AWS training in chennai | AWS training in annanagar | AWS training in omr | AWS training in porur | AWS training in tambaram | AWS training in velachery

    ReplyDelete
  3. It's late finding this act. At least, it's a thing to be familiar with that there are such events exist. I agree with your Blog and I will be back to inspect it more in the future so please keep up your act.data science course

    ReplyDelete
  4. Very good points you wrote here..Great stuff...I think you've made some truly interesting points.Keep up the good work.data science course in Hyderabad

    ReplyDelete
  5. "Thanks for the Information.Interesting stuff to read.Great Article.
    I enjoyed reading your post, very nice share.data science training"

    ReplyDelete
  6. I've read this post and if I could I desire to suggest you some interesting things or suggestions. Perhaps you could write next articles referring to this article. I want to read more things about it!
    Data Science course in Hyderabad

    ReplyDelete
  7. Thankyou for this wondrous post, I am happy I watched this site on yippee.
    data scientist training in hyderabad

    ReplyDelete
  8. Really awesome blog, Informative and knowledgeable content. Thanks for sharing this stuff with us. Keep sharing more and Thank you.
    Data Science Online Course in Hyderabad

    ReplyDelete
  9. This blog is an insightful journey into the topic! The author's expertise shines through, making complex concepts easy to understand. The engaging writing style kept me hooked from the beginning to the end. Looking forward to more enlightening reads from this blog! data science course kochi

    ReplyDelete