Spark: Solve Task not serializable Exception

One of the most frequently occurred exceptions when you use Spark is the Task not serializable exception:

org.apache.spark.SparkException: Task not serializable

This exception happens when you create an Non-Serializable Object on the Driver and try to use it on the the reducer.

Here is an example to produce such an exception:

Suppose we have a non serializable class named MyTokenlizer:

You submit a spark job like this:

Now you will get the org.apache.spark.SparkException: Task not serializable exception.

To solve this Exception, you can make the class serializable by implement the serializable interface:

Now the program runs perfectly.

What if the class is a third library?

If you use a third library, you will have no control of the class. In this case you can create the object inside the Map function like this:

This is not efficient, as it will create the object for each line of the data in RDD. We can use the mapPartition method:

In this case, the object will be created for each partition of the dataset.