  • Chi Square test for feature selection

    Feature selection is an important problem in Machine learning. There are many feature selection methods available such as mutual information, information gain, and chi square test. In this post, I will use simple examples to describe how to conduct feature selection using chi square test. I will show that it is easy to use Spark or MapReduce to conduct chi square test based feature selection on large scale data set. 

    Problem Statement

    Suppose there are N instances, and two classes: positive and negative.  Given a feature X, we can use Chi Square Test to evaluate its importance to distinguish the class. 

