Machine Learning | Learn for Master - Part 3
  • Best articles to learn deep learning

    A Step by Step Backpropagation Example



    Backpropagation is a common method for training a neural network. There is no shortage of papers online that attempt to explain how backpropagation works, but few that include an example with actual numbers. This post is my attempt to explain how it works with a concrete example that folks can compare their own calculations to in order to ensure they understand backpropagation correctly.

    If this kind of thing interests you, you should sign up for my newsletter where I post about AI-related projects that I’m working on.

    [Read More...]



    Using categorical data in Multiple Regression Models is a powerful method to include non-numeric data types into a regression model. Categorical data refers to data values which represent categories – data values with a fixed and unordered number of values, for instance gender (male/female) or season (summer/winder/spring/fall). In a regression model, these values can be represented by dummy variables – variables containing values such as 1 or 0 representing the presence or absence of the categorical value.

    By including dummy variable in a regression model however, one should be careful of the Dummy Variable Trap.

    [Read More...]
  • 四篇应该仔细读的关于文本分析的tutorial类文章



    第一篇:详细介绍了离散数据的参数估计方法,而不是像大多数教材中使用的Gaussian分布作为例子进行介绍。个人觉得最值得一读的地方是它使用Gibbs采样对LDA进行推断,其中相关公式的推导非常详细,是许多人了解LDA及其他相关topic model的必读文献。
    author = {Heinrich, Gregor},
    title = {Parameter Estimation for Text Analysis},
    institution = {vsonix GmbH and University of Leipzig},
    year = {2009},
    type = {Technical Report Version 2.9},
    abstract = {Presents parameter estimation methods common with discrete probability
    distributions, which is of particular interest in text modeling.
    Starting with maximum likelihood, a posteriori and Bayesian estimation,
    central concepts like conjugate distributions and Bayesian networks
    are reviewed. As an application, the model of latent Dirichlet allocation
    (LDA) is explained in detail with a full derivation of an aaproximate
    inference algorithm based on Gibbs sampling,

    [Read More...]
  • Resources for article extraction from HTML pages

    Here are some good resources to learn how to extract articles from html pages.

    Research papers and Articles for article extraction from HTML pages

    [Read More...]
  • Good blogs about LDA topic model

    Trig Email

    I have read some great articles about LDA. In particular, I like the posts about LDA gensim example. Gensim is popular library for text mining. It is written in Python and it is easy to use. Here are some good posts that are helpful to learn LDA. 

    If you are the author, and if you don’t want me to include your post here, please let me know, I will delete it. 

    Introduction to Latent Dirichlet Allocation

    This post is from


    Suppose you have the following set of sentences:

    • I like to eat broccoli and bananas.
    [Read More...]
  • Chi Square test for feature selection

    Feature selection is an important problem in Machine learning. There are many feature selection methods available such as mutual information, information gain, and chi square test. In this post, I will use simple examples to describe how to conduct feature selection using chi square test. I will show that it is easy to use Spark or MapReduce to conduct chi square test based feature selection on large scale data set. 

    Problem Statement

    Suppose there are N instances, and two classes: positive and negative.  Given a feature X, we can use Chi Square Test to evaluate its importance to distinguish the class. 

    [Read More...]
  • Good blogs to learn machine learning and data sciense

    • Occam’s Razor by Avinash Kaushik, examining web analytics and Digital Marketing.
    • OpenGardens, Data Science for Internet of Things (IoT), by Ajit Jaokar.
    • O’reilly Radar O’Reilly Radar, a wide range of research topics and books.
    • Observational Epidemiology A college professor and a statistical consultant offer their comments, observations and thoughts on applied statistics, higher education and epidemiology.
    • Overcoming bias By Robin Hanson and Eliezer Yudkowsky. Present Statistical analysis in reflections on honesty, signaling, disagreement, forecasting and the far future.
    • Probability &
    [Read More...]
  • Parse libsvm data for spark MLlib

    LibSVM data format is widely used in Machine Learning. Spark MLlib is a powerful tool to train large scale machine learning models.  If your data is well formatted in LibSVM, it is straightforward to use the loadLibSVMFile  method to transfer your data into an Rdd.  

    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    However, in certain cases, your data is not well formatted in LibSVM.  For example, you may have different models, and each model has its own labeled data. Suppose your data is stored into HDFS, and each line looks like this: (model_key, training_instance_in_livsvm_format).

    In this case, 

    [Read More...]
  • Parameter Server 资料汇总


    parameter server 介绍
    看看李沐的文章 《Parameter Server for Distributed Machine Learning》里面有包含他的框架的一些介绍。
    后面有看到微软研究院 project Adam的论文,大体思路比较相似,但论文中细节比较丰富,也会互补的一些信息描述下。


    1. 参数很大,超过单个机器的容纳能力(比如大型Logistic Regression和神经网络)
    2. 训练数据巨大,需要分布式并行提速(大数据)

    因此需要自己实现分布式并行程序,其实在Hadoop出来之前,对于大规模数据的处理,都需要自己写分布式的程序(MPI)。 之后这方面的工作流程被Google的工程师总结和抽象成MapReduce框架,大一统了。


    Parameter Server(Mli)



    并行计算这部分主要在计算节点上进行。 类似于MapReduce,分配任务时,会将数据拆分给每个worker节点。


    分发训练数据 -> 节点1




    [Read More...]
  • Popular Python libraries for Data Science and Machine Learning

    Python is almost a-must-have skill for data scientist, as you can see many data scientist positions require python programming skills. This post introduces some of the most popular python modules for data science. They are widely used to conducted projects related to data mining and machine learning, and normal data analysis.

    1. SciPy. SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. It provides a wide range of algorithms and mathematical tools for data scientist. 

    2. NumPy. NumPy is the fundamental package for scientific computing with Python. 

    [Read More...]
Page 3 of 41234