Spark | Learn for Master - Part 2
  • run pyspark on oozie

     In this post, I first give a workable example to run pySpark on oozie. Then I show how to run pyspark on oozie using your own python installation (e.g., anaconda). In this way, you can use numpy, pandas, other python libraries in your pyspark program. 

    The syntax of creating a spark action on oozie workflow

    As described in the document, here are the meanings of these elements.

    The prepare element, if present, indicates a list of paths to delete or create before starting the job. Specified paths must start with hdfs://HOST:PORT .

    [Read More...]
  • Learn spark by examples (2)

    In the previous post, we have already introduce Spark, RDD, and how to use RDD to do basic data analysis. In this post, I will show more examples on how to use the RDD method.

    Spark RDD reduceByKey Method

    We have used reduceByKey to solve the word frequency calculation problem. Here I will use a more complicated example to show how to use reduceByKey.

    Suppose we have a set of tweets, each was shared by different users. We also give each user a weight denoting his importance.

    Here is an example of the data set.

    [Read More...]
  • Learn Spark by Examples

    In this post, I briefly introduce Spark, and uses examples to show how to use the popular RDD method to analyze your data. You can refer to this post to setup the pySpark environment using Ipython Notebook. 

    SparkContext

    SparkContext, or Spark context is the entry point to develop a spark application using the spark infrastructure.

    Once a SparkContext object is created, it sets up the internal services and build a connection to the cluster managers, which manage the actual executors that conduct the specific computations.

    The following diagram from the Spark documentation visualize the spark architecture:

    The SparkContext object is usually referenced as the variable sc,

    [Read More...]
  • Pyspark broadcast variable Example

    Pyspark broadcast variable

    Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The concept of Broadcast variables is simular to Hadoop’s distributed cache.

    When to use broadcast variable

    The best case to use Broadcast variable is when you want to join two tables and one of them is small. By using Broadcast variable, we can implement a map-side join, which is much faster than reduce side join, as there is no shuffle, which is expensive. 

    Suppose we have the following Rdd, and we want to make join with another Rdd.

    [Read More...]
  • Parse libsvm data for spark MLlib

    LibSVM data format is widely used in Machine Learning. Spark MLlib is a powerful tool to train large scale machine learning models.  If your data is well formatted in LibSVM, it is straightforward to use the loadLibSVMFile  method to transfer your data into an Rdd.  

    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    However, in certain cases, your data is not well formatted in LibSVM.  For example, you may have different models, and each model has its own labeled data. Suppose your data is stored into HDFS, and each line looks like this: (model_key, training_instance_in_livsvm_format).

    In this case, 

    [Read More...]
  • Spark: Solve Task not serializable Exception

    One of the most frequently occurred exceptions when you use Spark is the Task not serializable exception:

    org.apache.spark.SparkException: Task not serializable

    This exception happens when you create an Non-Serializable Object on the Driver and try to use it on the the reducer.

    Here is an example to produce such an exception:

    Suppose we have a non serializable class named MyTokenlizer:

    You submit a spark job like this:

    Now you will get the org.apache.spark.SparkException: Task not serializable exception.

    To solve this Exception,

    [Read More...]
  • How to package a Scala project to a Jar file with SBT

    When you develop a Spark project using Scala language, you have to package your project into a jar file. This tutorial describes how to use SBT
    to compile and run a Scala project, and package the project as a Jar file. This will be helpful for you to create a spark project and package it to a jar file.

    The directory structure of a typical SBT project

    Here is an example to show a typical SBT project, which has the following directory structures. 

    .
    |-- build.sbt
    |-- lib
    |-- project
    |-- src
    |   |-- main
    |   |   |-- java (store main java files)
    |   |   |-- resources (store include in main jar)
    |   |   |-- scala (store main Scala source files)
    |   |-- test
    |       |-- java (store test java files)
    |       |-- resources (store files include in test jar)
    |       |-- scala (store test scala source files)
    |-- target

    You can use the following command to create this directory structures:

    #!/bin/sh
    cd ~/hello_world
    mkdir -p src/{main,test}/{java,resources,scala}
    mkdir lib project target
    
    # create an initial build.sbt file
    echo 'name := "MyProject"
    version := "1.0"
    scalaVersion := "2.10.0"' >
    [Read More...]
  • How to setup ipython notebook server to run spark in local or yarn model

    Ipython notebook is a powerful tool to learn python programming. In this post, I demonstrate how to setup a ipython notebook to to spark program in python.

    1. Install spark
      suppose spark is install at directory ~/spark, then execute:
    2. Install anaconda at ~/anaconda

      This will compress all the anaconda files to a zip file

      Run ipython notebook for pyspark using local model

    3. Now you can start a ipython notebook server in local model: WORKSPACE_DIR is the space where you want to save your codes.
      CONFIG_FILE is the location of the jupyter_notebook_config file.
    [Read More...]
  • Spark MLlib Example

    In this post, I will use an example to describe how to use pyspark, and show how to train a Support Vector Machine, and use the model to make predications using Spark MLlib.

    The following Program is developed using Ipython Notebook. Please refer to this article for how to set up in Ipython Notebook Server for PySpark, if you want to set up an ipython notebook server. You can also run the program use other python IDEs such Spyder or Pycharm. 

    A simple example to demonstrate how to use sc,

    [Read More...]
Page 2 of 212