Parse libsvm data for spark MLlib

Tags: , , ,

LibSVM data format is widely used in Machine Learning. Spark MLlib is a powerful tool to train large scale machine learning models.  If your data is well formatted in LibSVM, it is straightforward to use the loadLibSVMFile  method to transfer your data into an Rdd.  

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

However, in certain cases, your data is not well formatted in LibSVM.  For example, you may have different models, and each model has its own labeled data. Suppose your data is stored into HDFS, and each line looks like this: (model_key, training_instance_in_livsvm_format).

In this case, you can store the data by model_key, so each model_key has its own data folder. Another method is to parse the data yourself. 

The following code shows how to parse libsvm data so that it can be used to train a model using Spark MLlib. 

Suppose we load the data using sc.textFile(), then parse it into two parts: (model_key:String, libsvm_data_line: String ).  

Now we can get the train data  based on a model key and parse the libsvm data into RDD[LabeledPoint]