Popular Python libraries for Data Science and Machine Learning
Tags: Data Science, Machine Learning, pythonPython is almost amusthave skill for data scientist, as you can see many data scientist positions require python programming skills. This post introduces some of the most popular python modules for data science. They are widely used to conducted projects related to data mining and machine learning, and normal data analysis.

SciPy. SciPy (pronounced “Sigh Pie”) is a Pythonbased ecosystem of opensource software for mathematics, science, and engineering. It provides a wide range of algorithms and mathematical tools for data scientist.

NumPy. NumPy is the fundamental package for scientific computing with Python. It provides some advance math functionalities to python.

Scikitlearn: Scikitlearn is the most famous machine library for Python. It includes a broad range of different classifiers, crossvalidation and other model selection methods, dimensionality reduction techniques, modules for regression and clustering analysis, and a useful datapreprocessing module.

pandas: Pandas is a library for operating with tablelike structures. It comes with a powerful DataFrame object, which is a multidimensional array object for efficient numerical operations similar to NumPy’s ndarray with additional functionalities.

IPython. IPython is a command line shell for interactive computing with many useful enhancements over the “default” Python interpreter.
IPython Notebooks are a great environment for scientific computing: Not only to execute code, but also to add informative documentation via Markdown, HTML, LaTeX, embedded images, and inline data plots via e.g., matplotlib. It also provides high performance tools for parallel computing.

Requests. Requests is an elegant and simple HTTP library for Python, built for human beings. As a data scientist, you might have to collect data from the web. Requests provides a powerful tool for you.

Scrapy. An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Statsmodels: Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

Theano: If you are working on deep learning project, you may need theano. It is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently

gensim: gensim is one of the most robust, efficient and hasslefree piece of software to realize unsupervised semantic modelling from plain text. It can be easily used to train topic models. If you want to apply topic models on your text data, gensim is the one you should have a try.

SymPy: SymPy is a Python library for symbolic mathematical computations. It has a broad range of features, ranging from calculus, algebra, geometry, discrete mathematics, and even quantum physics. It also includes basic plotting functionality and print functions with LaTeX support.

PyMC: The focus of PyMC is Bayesian statistics and comes with a broad range of algorithms (including Markov Chain Monte Carlo, MCMC) for model fitting.

matplotlib. matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. It is a must have for any data scientist or any data analyst.

BeautifulSoup. You’re just trying to get some data out of it. Beautiful Soup is the one you need. You can use it extract contents from html pages.

nltk. Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data. If you work on NLP related projects, NLTK is a must to know tool.

sqlite3: This tool helps you store data easily. It provides a Python interface to SQLite databases. SQLitean opensource SQL database engine that is ideal for smaller workgroups, because it is a single locally stored database file (up to 140 Tb in size) that does not require — in contrast to SQL — any server infrastructure.
The are the libraries I often use for conducting data mining projects. If you have any thoughts or suggestions, please leave a comment.