Chi Square test is a popular method for feature selection. In this post, I describe how to use Spark to implement the chi square test algorithm for feature selection.
Implement Chi Square test for feature selection using Spark¶
We have described how to use chi square test for feature selection. Here, I use an example to show how to use Spark to implement the chi square test algorithm for feature selection. This can make the algorithm scalable to very large dataset
Given the following data in VW format:
label instanceWeight | feature1 feature2 feature3 feature4 …
Feature selection is an important problem in Machine learning. There are many feature selection methods available such as mutual information, information gain, and chi square test. In this post, I will use simple examples to describe how to conduct feature selection using chi square test. I will show that it is easy to use Spark or MapReduce to conduct chi square test based feature selection on large scale data set.
Suppose there are N instances, and two classes: positive and negative. Given a feature X, we can use Chi Square Test to evaluate its importance to distinguish the class.