Big Data | Learn for Master - Part 3
  • A Spark program using Scopt to Parse Arguments

    To develop a Spark program, we often need to read arguments from the command line. Scopt is a popular and easy-to-use argument parser. In this post, I provide a workable example to show how to use the scopt parser to read arguments for a spark program in scala. Then I describe how to run the spark job in yarn-cluster mode.

    The main contents of this post include:

    1. Use scopt option parser to parse arguments for a scala program.
    2. Use sbt to package the scala program
    3. Run spark on yarn-cluster mode with third party libraries

    Use Scopt to parse arguments in a scala program

    In the following program,

    [Read More...]
  • run pyspark on oozie

     In this post, I first give a workable example to run pySpark on oozie. Then I show how to run pyspark on oozie using your own python installation (e.g., anaconda). In this way, you can use numpy, pandas, other python libraries in your pyspark program. 

    The syntax of creating a spark action on oozie workflow

    As described in the document, here are the meanings of these elements.

    The prepare element, if present, indicates a list of paths to delete or create before starting the job. Specified paths must start with hdfs://HOST:PORT .

    [Read More...]
  • pyspark unit test based on python unittest library

    pyspark unit test

    Pyspark is a powerful framework for large scale data analysis. Because of the easy-to-use API, you can easily develop pyspark programs if you are familiar with Python programming.

    One problem is that it is a little hard to do unit test for pyspark. After some google search using “pyspark unit test”, I only get articles about using py.test or some other complicated libraries for pyspark unit test. However, I don’t want to install any other third party libraries .  What I want is to set up the pyspark unit test environment just based on the unittest library,

    [Read More...]
  • Learn spark by examples (2)

    In the previous post, we have already introduce Spark, RDD, and how to use RDD to do basic data analysis. In this post, I will show more examples on how to use the RDD method.

    Spark RDD reduceByKey Method

    We have used reduceByKey to solve the word frequency calculation problem. Here I will use a more complicated example to show how to use reduceByKey.

    Suppose we have a set of tweets, each was shared by different users. We also give each user a weight denoting his importance.

    Here is an example of the data set.

    [Read More...]
  • Learn Spark by Examples

    In this post, I briefly introduce Spark, and uses examples to show how to use the popular RDD method to analyze your data. You can refer to this post to setup the pySpark environment using Ipython Notebook. 


    SparkContext, or Spark context is the entry point to develop a spark application using the spark infrastructure.

    Once a SparkContext object is created, it sets up the internal services and build a connection to the cluster managers, which manage the actual executors that conduct the specific computations.

    The following diagram from the Spark documentation visualize the spark architecture:

    The SparkContext object is usually referenced as the variable sc,

    [Read More...]
  • Run hadoop command in Python

    Hadoop is the most widely used big data platform for big data analysis. It is easy to run Hadoop command in Shell or a shell script. However, there is often a need to run manipulate hdfs file directly from python. We use examples to describe how to run hadoop command in python to list, save hdfs files.

    We already know how to call an extern shell command from python. We can simply call Hadoop command using the run_cmd method.

    Run Hadoop ls command in Python


    Run Hadoop get command in Python

    Run Hadoop put command in Python


    [Read More...]
  • Pyspark broadcast variable Example

    Pyspark broadcast variable

    Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The concept of Broadcast variables is simular to Hadoop’s distributed cache.

    When to use broadcast variable

    The best case to use Broadcast variable is when you want to join two tables and one of them is small. By using Broadcast variable, we can implement a map-side join, which is much faster than reduce side join, as there is no shuffle, which is expensive. 

    Suppose we have the following Rdd, and we want to make join with another Rdd.

    [Read More...]
  • Parse libsvm data for spark MLlib

    LibSVM data format is widely used in Machine Learning. Spark MLlib is a powerful tool to train large scale machine learning models.  If your data is well formatted in LibSVM, it is straightforward to use the loadLibSVMFile  method to transfer your data into an Rdd.  

    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    However, in certain cases, your data is not well formatted in LibSVM.  For example, you may have different models, and each model has its own labeled data. Suppose your data is stored into HDFS, and each line looks like this: (model_key, training_instance_in_livsvm_format).

    In this case, 

    [Read More...]
  • Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications

    patterns & practices Developer Center

    Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications

    patterns & practices Developer Center

    Containing twenty-four design patterns and ten related guidance topics, this guide articulates the benefit of applying patterns by showing how each piece can fit into the big picture of cloud application architectures. It also discusses the benefits and considerations for each pattern. Most of the patterns have code samples or snippets that show how to implement the patterns using the features of Microsoft Azure. However the majority of topics described in this guide are equally relevant to all kinds of distributed systems,

    [Read More...]
  • Resources for Learning Big Data

    Big Data is a really hot topic! There are millions of job openings. This post aims to introduce some of the most important techniques related to Big Data.  If you want to learn Big Data and find a job related to Big Data, please read the articles and books recommended here. 

    The first key concept of Big Data is MapReduce, which is the core of Hadoop, a open sourced framework that becomes the foundation of the Big Data eco system. 

    • MapReduce was invented by Google. It’s a paradigm for writing distributed systems inspired by some elements of functional programming.
    [Read More...]
Page 3 of 512345