Feature selection is an important problem in Machine learning. There are many feature selection methods available such as mutual information, information gain, and chi square test. In this post, I will use simple examples to describe how to conduct feature selection using chi square test. I will show that it is easy to use Spark or MapReduce to conduct chi square test based feature selection on large scale data set.
Suppose there are N instances, and two classes: positive and negative. Given a feature X, we can use Chi Square Test to evaluate its importance to distinguish the class.[Read More...]