feature engineering in PySpark

We often need to do feature transformation to build a training data set before training a model. 

Here are some good examples to show how to transform your data, especially if you need to derive new features from other columns using

spark data frame. 

Encode and assemble multiple features in PySpark

Encode and assemble multiple features in PySpark

First of all StringIndexer.

Next OneHotEncoder:

VectorIndexer and VectorAssembler:

Finally you can wrap all of that using pipelines:

Arguably it is much robust and clean approach than writing everything from scratch. There are some caveats especially when you need consistent encoding between different datasets. You can read more in the official documentation for StringIndexer and VectorIndexer.

Regarding your questions:

make a UDF with similar functionality that I can use in a Spark SQL query (or some other way, I suppose)

It is just an UDF like any other. Make sure you use supported types and beyond that everything should work just fine.

take the RDD resulting from the map described above and add it as a new column to the user_data dataframe?

Note:

For Spark 1.x replace pyspark.ml.linalg with pyspark.mllib.linalg.

 

Adding column to PySpark DataFrame depending on whether column value is in another column

I have a PySpark DataFrame with structure given by

I need to add a further column with 1 or 0 depending on whether ‘item’ is in ‘fav_items’ or not.

So I would want

How would I look up for second column into third column to decide value and how would I then add it?

The following code does the requested task. An user defined function was defined that receives two columns of a DataFrame as parameters. So, for each row, search if an item is in the item list. If the item is found, a 1 is return, otherwise a 0.

Here the results:

 

Pandas: create two new columns in a dataframe with values calculated from a pre-existing column

I’d just use zip: