A Spark program using Scopt to Parse Arguments

Tags: , , ,

To develop a Spark program, we often need to read arguments from the command line. Scopt is a popular and easy-to-use argument parser. In this post, I provide a workable example to show how to use the scopt parser to read arguments for a spark program in scala. Then I describe how to run the spark job in yarn-cluster mode.

The main contents of this post include:

  1. Use scopt option parser to parse arguments for a scala program.
  2. Use sbt to package the scala program
  3. Run spark on yarn-cluster mode with third party libraries

Use Scopt to parse arguments in a scala program

In the following program, we use Scopt to parse command line arguments. One of the purposes of the this program is to accept two arguments: input and output. The program is intended to be running in this way:

To use Scopt, the first step is to define our own case class.

Here I define my own Config class, which is a case class, in which there are two parameters: input, output.

Then in the Main class, we build a scopt.OptionParser object with our Config class, like this:

Then we call 

Please refer to the following program for details. You can also refer to https://github.com/scopt/scopt for more details on how to use Scopt. 

Develop a spark program using sbt

We will develop the program using sbt, as it is easy to package the spark program into a jar file using SBT

The following is a simple spark program showing the process of using Scopt for argument parsing.

We first parse the arguments to get the input and output arguments. Then we build a sparkContext object to load the data from the input, and save the data into the ouput. 

Package spark jar

We can use the following commands to make a spark package using SBT:

You will get the jar file under the target folder.

Run spark with classpath

Now we can run the spark program using the spark-submit command. However, as we have a third party library, if we don’t specify the class, the submission will fail with the scopt.OptionParser not found Exception.  Fortunately, we can use --jars option to specify the third party libraries when submit a spark job. 

The following example shows how to submit spark job with dependent libraries. 

  1. Config  the spark environment:

2.  use –jars to specify the third party libraries. 

Some other examples:

Reference:

https://spark.apache.org/docs/1.5.1/running-on-yarn.html