This blog post is a short cheat sheet when it comes to getting started with data science and Python and is not a full list. The post lists some useful main topics to get familiar with:
- Basic tooling
- Using existing models
- Build your own models and machine learning
- Deep learning
- Useful learning resources
1. Basic tooling
- Python – Open-Source – Programming language
- Widely used by data scientists and very easy to consume when you work with strings, listings, and handle various file resources. I like this blog post for more details Top 7 Reasons Why You Need to Learn Python as a Data Scientist
- Jupyter Notebook – Open-Source Workbench for Data Science uses Python.
- Here is a useful blog post autocomplete in jupyter notebooks, when you are going to work with Jupyter Notebook.
- Natural Language Toolkit _(NLTK)_ which is Open-Source and is a leading platform for building Python programs to work with human language data.
- Stemming – “Where does a word come from?” (For details, please visit Sample usage for stem within NLTK)
- Valence Aware Dictionary for Sentiment Reasoning (VADER) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. (open-sourced)
- spaCy an Open-Source Natural Language Processing library (NLP) model for Python
2. Existing models
Maybe it’s a good choice to start with Natural Language Processing. Therefore you can start with spaCy an Open-Source Natural Language Processing library for Python and you can check out the following topics:
- Token/Tokenization – What is the category/label of a word/character/word combination? A token is an instance of a sequence of characters in some particular document that is grouped as a useful semantic unit for processing. from the Cambridge University Press. (Details in spaCY)
- Lemmatization is closely related to stemming.
- Stop words are the words that are filtered out before or after the processing of natural language data (text). (Details in spaCY)
- Matching finding matched tokens in a context.
- Parts of Speech PoS and tagging. (Details in spaCY)
- Visualizing some dependencies with Displacy. (Details in spaCY)
- Named entity recognition locates and classifies named entities mentioned in unstructured text. (Details in spaCY)
- Word2vec two-layer network
- SciPy contains fundamental algorithms for scientific computing in Python.
3. Build models and machine learning.
Machine learning algorithms build a model based on sample data, known as training data, to make predictions or decisions without being explicitly programmed to do so.
- Scikit learn which does contain simple and efficient tools for predictive data analysis.
- Supervised learning using labeled data for historical data (for example spam filters, what can be future spams?).
- Unsupervised learning is a type of algorithm that learns patterns from untagged data.
- term frequency–inverse document frequency (tf-idf) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Used in context with LDA and NMF.
- Latent Dirichlet allocation LDA is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar.
- Non-negotiable matrix factorization NMF is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements.
- Accuracy and precision how often do the model deliver the right result?
- A confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix).
4. Get started with deep learning.
- Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised, or unsupervised.
- Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning.
- Rectifier helps to provide flexible handling of input for the activation function.
- Keras is an open-source software library that provides a Python interface for artificial neural networks.
- A recurrent neural network is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes.
5. Useful learning resources
- Natural Language Processing with Python an Udemy course
- Getting Started with Python for Data Science a free course by codecademy
- How to Learn Python for Data Science in 5 Steps A blog post from DATAQUEST.
- Python for Data Science (Ultimate Quickstart Guide) A short free guide at Elite Data Science.
- Free learning on Cognitive Class.ai
I hope this was useful to you and let’s see what’s next?
Greetings,
Thomas
#ai, #cheatsheet, #datascience
Leave a Reply