Adding Multiple Columns to Spark DataFrames

Adding Multiple Columns to Spark DataFrames

from: https://p058.github.io/spark/2017/01/08/spark-dataframes.html

I have been using spark’s dataframe API for quite sometime and often I would want to add many columns to a dataframe(for ex : Creating more features from existing features for a machine learning model) and find it hard to write many withColumn statements. So I monkey patched spark dataframe to make it easy to add multiple columns to spark dataframe.

First lets create a udf_wrapper decorator to keep the code concise

Lets create a spark dataframe with columns, user_id, app_usage (app and number of sessions of each app), hours active

Now lets add a column, total_app_usage, indicating total number of sessions

Now lets add another column, evening_user, indicating whether or not the user is active between 18-21 hours

Instead of writing multiple withColumn statements lets create a simple util function to apply multiple functions to multiple columns

Now lets use the add_columns method to add multiple columns

You can also use spark builtin functions along with your own udf’s. As you have seen above, you can also apply udf’s on multiple columns by passing the old columns as a list.

You can always covert a dataframe to RDD, add columns, and switch it back to a dataframe but I don’t find it to be a very neat way to do it.

Thanks for reading.