A Spark program using Scopt to Parse Arguments
To develop a Spark program, we often need to read arguments from the command line. Scopt is a popular and easy-to-use argument parser. In this post, I provide a workable example to show how to use the scopt parser to read arguments for a spark program in scala. Then I describe how to run the spark job in yarn-cluster mode.
The main contents of this post include:
- Use scopt option parser to parse arguments for a scala program.
- Use sbt to package the scala program
- Run spark on yarn-cluster mode with third party libraries
Use Scopt to parse arguments in a scala program
In the following program, we use Scopt to parse command line arguments. One of the purposes of the this program is to accept two arguments: input
and output
. The program is intended to be running in this way:
1 |
spark-submit --class the_scala_program_Main_class_Name the_program.jar --input /tmp/xx/spark_test_input --output /tmp/xx/spark_test_out |
To use Scopt, the first step is to define our own case class.
Here I define my own Config
class, which is a case class, in which there are two parameters: input
, output
.
Then in the Main class, we build a scopt.OptionParser
object with our Config
class, like this:
1 |
val parser = new scopt.OptionParser[Config]("scopt") { opt[String]('f', "input") required() action { (x, c) => c.copy(input = x) } text("input is the input path") } .... } |
Then we call
1 2 3 |
parser.parse(args, Config()) map { config => {// everything is good, we can call config.input to get the input value} } getOrElse {// no argument provided} |
Please refer to the following program for details. You can also refer to https://github.com/scopt/scopt for more details on how to use Scopt.
Develop a spark program using sbt
We will develop the program using sbt, as it is easy to package the spark program into a jar file using SBT.
The following is a simple spark program showing the process of using Scopt for argument parsing.
We first parse the arguments to get the input and output arguments. Then we build a sparkContext object to load the data from the input, and save the data into the ouput.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
package my.spark import java.io.File import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.rdd.RDD import scopt.OptionParser case class Config(input: String = null, output:String = null) object Main extends App { println("hello Scopt") val parser = new scopt.OptionParser[Config]("scopt") { head("scopt", "3.x") opt[String]('f', "input") required() action { (x, c) => c.copy(input = x) } text("input is the input path") opt[String]('o', "output") required() action { (x, c) => c.copy(output = x) } text("output is the output path") } // parser.parse returns Option[C] parser.parse(args, Config()) map { config => // do stuff val input = config.input val output = config.output println("input==" + input) println("output=" + output) val sc = new SparkContext() val rdd = sc.textFile(input) rdd.saveAsTextFile(output) } getOrElse { // arguments are bad, usage message will have been displayed } } |
Package spark jar
We can use the following commands to make a spark package using SBT:
1 2 3 4 5 |
rub sbt compile package |
You will get the jar file under the target folder.
Run spark with classpath
Now we can run the spark program using the spark-submit command. However, as we have a third party library, if we don’t specify the class, the submission will fail with the scopt.OptionParser
not found Exception. Fortunately, we can use --jars
option to specify the third party libraries when submit a spark job.
The following example shows how to submit spark job with dependent libraries.
- Config the spark environment:
1 2 3 4 5 |
-bash-4.1$ export SPARK_HOME=/home/xx/spark/1.5 -bash-4.1$ export HADOOP_HOME=/home/xx/hadoop/current -bash-4.1$ export PATH=$SPARK_HOME/bin:$PATH -bash-4.1$ SPARK_CONF_HOME=/homes/xx/start_spark -bash-4.1$ export SPARK_CONF_DIR=$SPARK_CONF_HOME/sparkconf1.5_scala |
2. use –jars to specify the third party libraries.
1 |
spark-submit --jars scopt_2.10-3.5.0.jar,other_jars.jar --name "Spark-APP" --master yarn --deploy-mode cluster --class my.spark.Main test_spark_2.10-1.0.jar --input /tmp/xx/spark_test_input --output /tmp/xx/spark_test_out5 |
Some other examples:
1 2 3 4 5 |
./bin/spark-submit --class my.main.Class \ --master yarn-cluster \ --jars my-other-jar.jar,my-other-other-jar.jar my-main-jar.jar app_arg1 app_arg2 |
Reference:
https://spark.apache.org/docs/1.5.1/running-on-yarn.html