Data Scientist's Diary: 2015

Thursday, October 29, 2015

Search query understanding

Nice presentation about search queries:

Better Search Through Query Understanding from Daniel Tunkelang

And amazing article about search experience optimization.

Wednesday, September 30, 2015

Memory Networks is on GitHub now

Great news! Facebook made Memory Networks project public. Memory Networks is the research project, which implemented kind of human long-term memory.

Talk about Memory Networks:

Paper: http://arxiv.org/abs/1410.3916

Here you can find the link to github project: https://github.com/facebook/MemNN

Tuesday, September 1, 2015

Apache Spark 1.5. What's new?

Apache Spark 1.5 presented by Databricks co-founder Patrick Wendell from Databricks

Thursday, July 16, 2015

ICML 2015 Word Cloud

Nice visualisation from Andrew Collier. It is word cloud of 300 most popular words from accepted ICML2015 papers.

http://www.exegetic.biz/blog/wp-content/uploads/2015/07/word-cloud.png

Methodology of this word cloud generation: http://www.exegetic.biz/blog/2015/07/constructing-word-cloud-for-icml-2015/
List of presented papers can be found here: http://icml.cc/2015/?page_id=825

Tuesday, July 14, 2015

Recommendation papers from ArXiv

Sometimes you found some idea and wondering, why you are not implemented it earlier.
The idea:The arXiv is a repository of over 1 million preprints in physics, mathematics and computer science. So, it is possible to train recommender for papers and optimize searching process.
Here you can find full description: https://blog.lateral.io/2015/07/harvesting-research-arxiv/

And motivating image:

[source http://physicsbuzz.physicscentral.com/2012/08/risks-and-rewards-of-arxiv-reporting.html]

Monday, May 11, 2015

Emoji natural language processing

In Instagram Engineering Blog you can read about NLP techniques for discovering "context" of emoji
(like this one on the picture). They use word2vec for mapping every emoji to metric space and t-SNE as visualisation tool.
Full texts of articles:

Emojineering Part 1: Machine Learning for Emoji Trends

Emojineering Part 2: Implementing Hashtag Emoji
Emoji Wiki

Friday, April 17, 2015

Words similarity

Finding words (sentences, documents) with the same meaning is general problem for NLP (Natural Language Processing). Deep learning helps improve this field of science.
For example, word2vec approach helps you derive from text corpus some things with relationship like "man to king" as "women to ?". And "?" should be replaced by "queen". It's amazing stuff. In addition, you can train not just similarity word-to-word but also word-to-sequence of words.
Here is some examples from model which were trained on Google News corpus:

Paper with description:
Distributed Representations of Words and Phrases and their Compositionality"

Open-source implemenatation.
https://code.google.com/p/word2vec/

Tuesday, April 7, 2015

Deep Learning for Natural Language Processing

Videos, slides and tutorials fopm Stanford university course
CS224d: Deep Learning for Natural Language Processing can be found here: http://cs224d.stanford.edu/syllabus.html.

Thursday, April 2, 2015

Spark for Data Science

In June you can learn with EdX "how to apply data science techniques using parallel programming in Apache Spark to explore big (and small) data". It will be "Introduction to Big Data with Apache Spark" course from Berkeley.

Friday, March 13, 2015

Data Quality Checklist for Process Mining

Nice paper from Fluxicon about basic problems in process mining workflow, how to discover and fix them.

Tuesday, January 20, 2015

Process Mining: Data Science in Action by Coursera

This is my short feedback for Coursera Process Mining course (by Wil van der Aalst from Eindhoven University of Technology)
Name of the course sounds very interesting, but the main task is quite simple.
It's about building behaviour model based on events log(you need to consider overfitting and underfitting). That's mean: your model should explain majority of cases and be general enough for explaining new cases.
Main tools, recommended in the course, are Disco and ProM. They allow building models according to different notations(e.g. BPMN) and making visualisations.
Two main aspects of process mining are organisational and social aspects:
Organisational aspects tasks:

discover typical workflow actions(for customers, employees, etc)
analyse of time spent for every tasks
"bottlenecks" mining

Social aspects tasks:

discover users groups and users relations within process
analysis of time spending for every worker, customer, etc

In addition, in both aspects you can recommend next steps or forecast time of completion future tasks.

Lectors slideshare: http://www.slideshare.net/wvdaalst
Next session: April-May 2015

Visualizing Data using t-SNE

If you need solution for visualization of high-dimensional datasets- t-SNE is a great choice.
Video:

Friday, January 9, 2015

Dive into Deep Learning

If you are interested in deep learning - you can try UFLDL (Unsupervised Feature Learning and Deep Learning) tutorial from Stanford University.
If this topic is really new for you - it is better to start from https://www.coursera.org/course/ml. After this session course will change format into self-study.