• spark submit multiple jars

    It is straight to include only one dependency jar file when submit Spark jobs. See the following example:

    How about including multiple jars? See I want to include all the jars like this: ./lib/*.jar. 

    According to spark-submit‘s –help, the –jars option expects a comma-separated list of local jars to include on the driver and executor classpaths.

    However,  ./lib/*.jar is expanding into a space-separated list of jars. 

    According to this answer on StackOverflow, we have different ways to generate a list of jars that are separated by comma. 

    [Read More...]
  • Count word frequency

    Count word frequency is a popular task for text analysis. In this post, I describe how to count word frequency using Java HashMap, python dictionary, and Spark. 

    Use Java HashMap to Count Word frequency

    {a=5, b=2, c=6, d=3}

    Use Python Dict to count word frequency

    The output:

    {‘a’: 5, ‘c’: 6, ‘b’: 2, ‘d’: 3}

    Use Spark to count word Frequency

    The above method works well for small dataset. However, if you have a huge dataset, the hashTable based method will not work. You will need to develop a distributed program to accomplish this task.

    [Read More...]
  • Run spark on oozie with command line arguments

    We have described how to use oozie to run a pyspark program.  This post will use a simple example to show how to use oozie to run a spark program in scala. 

    You might be interested in: 1. develop a spark program using SBT.  2. Parse arguments for a spark program using Scopt.  

    Here are the key points of this post:

    1. A workable example to show how to use oozie spark action to run a spark program
    2. How to specify third party libraries in oozie
    3. How to specify command line arguments to the spark program in oozie

    The following code shows the content of the workflow.xml file,

    [Read More...]
  • A Spark program using Scopt to Parse Arguments

    To develop a Spark program, we often need to read arguments from the command line. Scopt is a popular and easy-to-use argument parser. In this post, I provide a workable example to show how to use the scopt parser to read arguments for a spark program in scala. Then I describe how to run the spark job in yarn-cluster mode.

    The main contents of this post include:

    1. Use scopt option parser to parse arguments for a scala program.
    2. Use sbt to package the scala program
    3. Run spark on yarn-cluster mode with third party libraries

    Use Scopt to parse arguments in a scala program

    In the following program,

    [Read More...]
  • run pyspark on oozie

     In this post, I first give a workable example to run pySpark on oozie. Then I show how to run pyspark on oozie using your own python installation (e.g., anaconda). In this way, you can use numpy, pandas, other python libraries in your pyspark program. 

    The syntax of creating a spark action on oozie workflow

    As described in the document, here are the meanings of these elements.

    The prepare element, if present, indicates a list of paths to delete or create before starting the job. Specified paths must start with hdfs://HOST:PORT .

    [Read More...]
  • Learn spark by examples (2)

    In the previous post, we have already introduce Spark, RDD, and how to use RDD to do basic data analysis. In this post, I will show more examples on how to use the RDD method.

    Spark RDD reduceByKey Method

    We have used reduceByKey to solve the word frequency calculation problem. Here I will use a more complicated example to show how to use reduceByKey.

    Suppose we have a set of tweets, each was shared by different users. We also give each user a weight denoting his importance.

    Here is an example of the data set.

    [Read More...]
  • Parse libsvm data for spark MLlib

    LibSVM data format is widely used in Machine Learning. Spark MLlib is a powerful tool to train large scale machine learning models.  If your data is well formatted in LibSVM, it is straightforward to use the loadLibSVMFile  method to transfer your data into an Rdd.  

    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    However, in certain cases, your data is not well formatted in LibSVM.  For example, you may have different models, and each model has its own labeled data. Suppose your data is stored into HDFS, and each line looks like this: (model_key, training_instance_in_livsvm_format).

    In this case, 

    [Read More...]
  • Spark: Solve Task not serializable Exception

    One of the most frequently occurred exceptions when you use Spark is the Task not serializable exception:

    org.apache.spark.SparkException: Task not serializable

    This exception happens when you create an Non-Serializable Object on the Driver and try to use it on the the reducer.

    Here is an example to produce such an exception:

    Suppose we have a non serializable class named MyTokenlizer:

    You submit a spark job like this:

    Now you will get the org.apache.spark.SparkException: Task not serializable exception.

    To solve this Exception,

    [Read More...]
  • How to package a Scala project to a Jar file with SBT

    When you develop a Spark project using Scala language, you have to package your project into a jar file. This tutorial describes how to use SBT
    to compile and run a Scala project, and package the project as a Jar file. This will be helpful for you to create a spark project and package it to a jar file.

    The directory structure of a typical SBT project

    Here is an example to show a typical SBT project, which has the following directory structures. 

    .
    |-- build.sbt
    |-- lib
    |-- project
    |-- src
    |   |-- main
    |   |   |-- java (store main java files)
    |   |   |-- resources (store include in main jar)
    |   |   |-- scala (store main Scala source files)
    |   |-- test
    |       |-- java (store test java files)
    |       |-- resources (store files include in test jar)
    |       |-- scala (store test scala source files)
    |-- target

    You can use the following command to create this directory structures:

    #!/bin/sh
    cd ~/hello_world
    mkdir -p src/{main,test}/{java,resources,scala}
    mkdir lib project target
    
    # create an initial build.sbt file
    echo 'name := "MyProject"
    version := "1.0"
    scalaVersion := "2.10.0"' >
    [Read More...]
  • How to setup ipython notebook server to run spark in local or yarn model

    Ipython notebook is a powerful tool to learn python programming. In this post, I demonstrate how to setup a ipython notebook to to spark program in python.

    1. Install spark
      suppose spark is install at directory ~/spark, then execute:
    2. Install anaconda at ~/anaconda

      This will compress all the anaconda files to a zip file

      Run ipython notebook for pyspark using local model

    3. Now you can start a ipython notebook server in local model: WORKSPACE_DIR is the space where you want to save your codes.
      CONFIG_FILE is the location of the jupyter_notebook_config file.
    [Read More...]
Page 1 of 212