pyspark unit test based on python unittest library

Tags: , ,

pyspark unit test

Pyspark is a powerful framework for large scale data analysis. Because of the easy-to-use API, you can easily develop pyspark programs if you are familiar with Python programming.

One problem is that it is a little hard to do unit test for pyspark. After some google search using “pyspark unit test”, I only get articles about using py.test or some other complicated libraries for pyspark unit test. However, I don’t want to install any other third party libraries .  What I want is to set up the pyspark unit test environment just based on the unittest library, which is currently used by the project.

Fortunately, I found a file from the spark github repository. Based on the code, I made a simple example here to describe the process to setup pyspark unit test environment.  The advantage of this method is that the setup is extremely easy comparing with other third party library based  Pyspark unit test.

The pyspark unit test base class

There are two base classes defined for pyspark unit test. Both of tem extend the unittest.TestCase class. The first class is the ReusedPySparkTestCase, which can reuse the sparkContext across all unit test methods, as the sparkContext sc is initialized in the setUpClass() method and stopped in the tearDownClass() method.  

Based on the python documentation:

This means setUpClass and tearDownClass are run once for the whole class,  so we can share the initialized sparkContext across the test methods. 

For the PySparkTestCase, the sparkContext is initialized in the setUp method and stopped in the tearDown method, So each test method will have its own sparkContext. This is because setUp and tearDown are run before and after each test method. 

The advantage of using ReusedPySparkTestCase class is that all the unit test methods in the same test class can share or reuse the same sparkContext.  If you have many test methods, by reusing the sparkContext can save time as the initialization of the sparkContext is time consuming. 

In the following code, I use simple examples to show that all the test methods share the same sparkContext when we extend the ReusedPySparkTestCase class. 

The output of the pyspark unit test

From the following output, we can see that all the test methods of TestResusedScA class share the same sparkContext, and all the test methods of TestResusedScB share the same sparkContext. 

The output when call the tearDownClass for TestResusedScA:

The output when call the tearDownClass for TestResusedScB:

 

Run the pyspark unit test

To run the above unit test for pyspark, we need to export the SPARK_HOME variable.  Just run the following commands to start the pyspark unit test program.

 

In each of  the test methods, as we can get the sparkContext reference by calling self.sc, we can conduct more complicated test using Spark RDD, and call self.assert* method to test our pyspark program. 

A simple pyspark unit test example

In the following example, we develop a pyspark program to count the frequency of words in a set of sentences. Then we build a testClass to test the program. 

testbase is the python module that contains the definition of the ReusedPySparkTestCase class. 

Reference:

The ReusedPySparkTestCase is defined in the following file of the  spark github repository. You can refer to this file for more examples on how to do the pyspark unit test. 

https://github.com/apache/spark/blob/master/python/pyspark/tests.py

  • Sven Hofstede

    Thank you very much. This helped a lot