Cheat Sheet: Get started with Data Science and Python

This blog post is a short cheat sheet when it comes to getting started with data science and Python and is not a full list. The post lists some useful main topics to get familiar with:

Basic tooling
Using existing models
Build your own models and machine learning
Deep learning
Useful learning resources

1. Basic tooling

Python – Open-Source – Programming language
- Widely used by data scientists and very easy to consume when you work with strings, listings, and handle various file resources. I like this blog post for more details Top 7 Reasons Why You Need to Learn Python as a Data Scientist
Jupyter Notebook – Open-Source Workbench for Data Science uses Python.
Here is a useful blog post autocomplete in jupyter notebooks, when you are going to work with Jupyter Notebook.
Natural Language Toolkit _(NLTK)_ which is Open-Source and is a leading platform for building Python programs to work with human language data.
- Stemming – “Where does a word come from?” (For details, please visit Sample usage for stem within NLTK)
- Valence Aware Dictionary for Sentiment Reasoning (VADER) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. (open-sourced)
spaCy an Open-Source Natural Language Processing library (NLP) model for Python

2. Existing models

Maybe it’s a good choice to start with Natural Language Processing. Therefore you can start with spaCy an Open-Source Natural Language Processing library for Python and you can check out the following topics:

Token/Tokenization – What is the category/label of a word/character/word combination? A token is an instance of a sequence of characters in some particular document that is grouped as a useful semantic unit for processing. from the Cambridge University Press. (Details in spaCY)
Lemmatization is closely related to stemming.
Stop words are the words that are filtered out before or after the processing of natural language data (text). (Details in spaCY)
Matching finding matched tokens in a context.
Parts of Speech PoS and tagging. (Details in spaCY)
Visualizing some dependencies with Displacy. (Details in spaCY)
Named entity recognition locates and classifies named entities mentioned in unstructured text. (Details in spaCY)
Word2vec two-layer network
SciPy contains fundamental algorithms for scientific computing in Python.

3. Build models and machine learning.

Machine learning algorithms build a model based on sample data, known as training data, to make predictions or decisions without being explicitly programmed to do so.

Scikit learn which does contain simple and efficient tools for predictive data analysis.
Supervised learning using labeled data for historical data (for example spam filters, what can be future spams?).
Unsupervised learning is a type of algorithm that learns patterns from untagged data.
- term frequency–inverse document frequency (tf-idf) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Used in context with LDA and NMF.
- Latent Dirichlet allocation LDA is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar.
- Non-negotiable matrix factorization NMF is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements.
Accuracy and precision how often do the model deliver the right result?
A confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix).

4. Get started with deep learning.

Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised, or unsupervised.
- Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning.
- Rectifier helps to provide flexible handling of input for the activation function.
- Keras is an open-source software library that provides a Python interface for artificial neural networks.
- A recurrent neural network is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes.

5. Useful learning resources

Natural Language Processing with Python an Udemy course
Getting Started with Python for Data Science a free course by codecademy
How to Learn Python for Data Science in 5 Steps A blog post from DATAQUEST.
Python for Data Science (Ultimate Quickstart Guide) A short free guide at Elite Data Science.
Free learning on Cognitive Class.ai

I hope this was useful to you and let’s see what’s next?

Greetings,

Thomas

#ai, #cheatsheet, #datascience

Cheat Sheet: Get started with Data Science and Python

1. Basic tooling

2. Existing models

3. Build models and machine learning.

4. Get started with deep learning.

5. Useful learning resources

Leave a comment Cancel reply

Blog Stats

1. Basic tooling

2. Existing models

3. Build models and machine learning.

4. Get started with deep learning.

5. Useful learning resources

Share this:

Related

Leave a comment Cancel reply

Blog Stats