.
  • pySpark check if file exists

    When using spark, we often need to check whether a hdfs path exist before load the data, as if the path is not valid, we will get the following exception:

    org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://…xxx matches 0 files

    In this post,  I describe two methods to check whether a hdfs path exist in pyspark. 

    The first solution is to try to load the data and put the code into a try block, we try to read the first element from the RDD. If the file does not exist, there will be  Py4JJavaError. We catch the error and return False. 

    [Read More...]
  • run pyspark on oozie

     In this post, I first give a workable example to run pySpark on oozie. Then I show how to run pyspark on oozie using your own python installation (e.g., anaconda). In this way, you can use numpy, pandas, other python libraries in your pyspark program. 

    The syntax of creating a spark action on oozie workflow

    As described in the document, here are the meanings of these elements.

    The prepare element, if present, indicates a list of paths to delete or create before starting the job. Specified paths must start with hdfs://HOST:PORT .

    [Read More...]
  • pyspark unit test based on python unittest library

    pyspark unit test

    Pyspark is a powerful framework for large scale data analysis. Because of the easy-to-use API, you can easily develop pyspark programs if you are familiar with Python programming.

    One problem is that it is a little hard to do unit test for pyspark. After some google search using “pyspark unit test”, I only get articles about using py.test or some other complicated libraries for pyspark unit test. However, I don’t want to install any other third party libraries .  What I want is to set up the pyspark unit test environment just based on the unittest library,

    [Read More...]
  • Learn spark by examples (2)

    In the previous post, we have already introduce Spark, RDD, and how to use RDD to do basic data analysis. In this post, I will show more examples on how to use the RDD method.

    Spark RDD reduceByKey Method

    We have used reduceByKey to solve the word frequency calculation problem. Here I will use a more complicated example to show how to use reduceByKey.

    Suppose we have a set of tweets, each was shared by different users. We also give each user a weight denoting his importance.

    Here is an example of the data set.

    [Read More...]
  • Learn Spark by Examples

    In this post, I briefly introduce Spark, and uses examples to show how to use the popular RDD method to analyze your data. You can refer to this post to setup the pySpark environment using Ipython Notebook. 

    SparkContext

    SparkContext, or Spark context is the entry point to develop a spark application using the spark infrastructure.

    Once a SparkContext object is created, it sets up the internal services and build a connection to the cluster managers, which manage the actual executors that conduct the specific computations.

    The following diagram from the Spark documentation visualize the spark architecture:

    The SparkContext object is usually referenced as the variable sc,

    [Read More...]
  • Pyspark broadcast variable Example

    Pyspark broadcast variable

    Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The concept of Broadcast variables is simular to Hadoop’s distributed cache.

    When to use broadcast variable

    The best case to use Broadcast variable is when you want to join two tables and one of them is small. By using Broadcast variable, we can implement a map-side join, which is much faster than reduce side join, as there is no shuffle, which is expensive. 

    Suppose we have the following Rdd, and we want to make join with another Rdd.

    [Read More...]
  • How to setup ipython notebook server to run spark in local or yarn model

    Ipython notebook is a powerful tool to learn python programming. In this post, I demonstrate how to setup a ipython notebook to to spark program in python.

    1. Install spark
      suppose spark is install at directory ~/spark, then execute:
    2. Install anaconda at ~/anaconda

      This will compress all the anaconda files to a zip file

      Run ipython notebook for pyspark using local model

    3. Now you can start a ipython notebook server in local model: WORKSPACE_DIR is the space where you want to save your codes.
      CONFIG_FILE is the location of the jupyter_notebook_config file.
    [Read More...]