Data Scientist's Diary: Distributed Word Representation

Friday, May 13, 2016

Distributed Word Representation

Today we will talk about the main "building block" in deep learning application for NLP - vectors.
Every part - phoneme, word, sub-sentence, sentence, even the whole document could be represented as a vector. I found it really cool.
How to get this representation?
The most straightforward way is to build a word-documents matrix. This matrix will be sparse, so the next step should be a dimensionality reduction (e.g SVD).
The main problem here is an expensive computation (computation cost scales quadratically for n x m matrix O(m x n x n) when (n < m))
Another approach is to learn vector representation directly from the data. This algorithm (named word2vec) was suggested in 2013 by Mikolov. Actually, word2vec is a two algorithms: CBOW(continuous bag of words) and Skip-Gram. In CBOW you are predicting the word, based on words before and after. In Skip-Gram the task is opposite - context prediction based on words.

With this approach, you can very quickly learn words representation(e.g words representation for all words in English wiki (~80 GB unzipped texts) could be learnt in ~ 10 hours with office laptop).
You could directly measure the similarity between result vectors (and get a similarity between words context e.g. 'stock market' = 'thermometer', with similarity equal to 0.72). Also, you could use the vectors as building blocks for more complex neural nets.
This approach unlocks really cool new operations, like adding or subtraction word representations which look like adding or subtraction context of words.

Or even cooler:
Iraq - Violence = Jordan
President - Power = Prime Minister
Guys from Instagram applied this technique for obtaining meanings of emoji.
Example:

Interested in this topic? You can read more here:
Mikolov original paper:
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Instagram Engineering Blog:
http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji
Cool examples (I used them above) http://byterot.blogspot.co.uk/2015/06/five-crazy-abstractions-my-deep-learning-word2doc-model-just-did-NLP-gensim.html

17 comments:

VijayNovember 16, 2018 at 9:40 AM
Very informative blog, thanks. Hadoop Big Data Classes in Pune
ReplyDelete
Replies
manishaMarch 16, 2020 at 1:53 PM
This is good information and really helpful for the people who need information about this.
Data Science Training in Delhi
Data Science Training institute in Delhi
ReplyDelete
Replies
varshaJune 4, 2020 at 2:38 PM
At a high level, you can control all of these with extensive administrative controls accessible via a secure Web client.For more information visit
AWS training in chennai | AWS training in annanagar | AWS training in omr | AWS training in porur | AWS training in tambaram | AWS training in velachery
ReplyDelete
Replies
tejaswiniJuly 6, 2020 at 6:02 AM
It's late finding this act. At least, it's a thing to be familiar with that there are such events exist. I agree with your Blog and I will be back to inspect it more in the future so please keep up your act.data science course
ReplyDelete
Replies
EXCELRNovember 26, 2020 at 12:31 PM
Very good points you wrote here..Great stuff...I think you've made some truly interesting points.Keep up the good work.data science course in Hyderabad
ReplyDelete
Replies
EXCELRDecember 8, 2020 at 1:44 PM
"Thanks for the Information.Interesting stuff to read.Great Article.
I enjoyed reading your post, very nice share.data science training"
ReplyDelete
Replies
Excelr TuhinJanuary 30, 2021 at 7:03 PM
I've read this post and if I could I desire to suggest you some interesting things or suggestions. Perhaps you could write next articles referring to this article. I want to read more things about it!
Data Science course in Hyderabad
ReplyDelete
Replies
ManeeshaMay 11, 2021 at 8:43 AM
Thankyou for this wondrous post, I am happy I watched this site on yippee.
data scientist training in hyderabad
ReplyDelete
Replies
Ramesh SampangiDecember 13, 2021 at 8:35 AM
Really awesome blog, Informative and knowledgeable content. Thanks for sharing this stuff with us. Keep sharing more and Thank you.
Data Science Online Course in Hyderabad
ReplyDelete
Replies
kamilMarch 21, 2023 at 9:54 PM
betmatik
kralbet
betpark
mobil ödeme bahis
tipobet
slot siteleri
kibris bahis siteleri
poker siteleri
bonus veren siteler
35RY
ReplyDelete
Replies
halimeAugust 2, 2023 at 11:01 PM
kütahya
tunceli
ardahan
düzce
siirt
ZADHV
ReplyDelete
Replies
datascienceNovember 20, 2023 at 6:51 AM
This blog is an insightful journey into the topic! The author's expertise shines through, making complex concepts easy to understand. The engaging writing style kept me hooked from the beginning to the end. Looking forward to more enlightening reads from this blog! data science course kochi
ReplyDelete
Replies
AnonymousDecember 22, 2024 at 10:49 AM
شركة مكافحة حشرات بالجبيل RxlgV6TBf1
ReplyDelete
Replies
SM FIBER LINKSMay 16, 2025 at 9:55 AM
Very informative blog, thanks.
Business internet Hyderabad
ReplyDelete
Replies
SM FIBER LINKSJune 21, 2025 at 1:10 PM
Very informative blog, thanks.
Unlimited broadband Hyderabad
ReplyDelete
Replies
SM FIBER LINKSJune 25, 2025 at 8:24 AM
Very informative blog, thanks.
Fiber optic internet Hyderabad
ReplyDelete
Replies
SM FIBER LINKSJuly 1, 2025 at 10:52 AM
Very informative blog, thanks.
Fiber optic internet Hyderabad
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)