Learn Spark by Examples
In this post, I briefly introduce Spark, and uses examples to show how to use the popular RDD method to analyze your data. You can refer to this post to setup the pySpark environment using Ipython Notebook.
SparkContext
SparkContext, or Spark context is the entry point to develop a spark application using the spark infrastructure.
Once a SparkContext object is created, it sets up the internal services and build a connection to the cluster managers, which manage the actual executors that conduct the specific computations.
The following diagram from the Spark documentation visualize the spark architecture:
The SparkContext object is usually referenced as the variable
sc
, It can be used create RDDs, accumulators and broadcast variables, and access Spark services.
What is RDD – Resilient Distributed Datasets (RDD)
The core data structure in Spark is RDD: resilient distributed dataset. In Spark, RDD represents of a dataset that is distributed across clusters. An RDD object can be thought as a collection of data, which can be tuples, strings, dictionaries, et al.
The data in RDD can be stored in RAM or in Disk if there is no space in the RAM. As the data can be stored in RAM, spark is especially efficient for iterative computing such as machine learning model training and Page Rank calculation.
Similar to DataFrames in Pandas, a dataset can be transfered into RDD, then you can run any of the methods associated with it.
Learning Spark through Examples
Transfer your dataset into RDD¶
Suppose your data is stored in a file, can be a local file or a hdfs file. You can use sc.textFile(file_path) to create a RDD from the data.
rdd = sc.textFile("file:///homes/user/test.txt")
rdd.take(2)
Use parallelize method to transfer a iterator into an RDD¶
ints = sc.parallelize(range(10))
ints.take(11)
Use Rdd Map to generate a new Rdd from exist¶
squares = ints.map(lambda num: num * num)
squares.take(11)
Use RDD filter Method¶
even = ints.filter(lambda num: num % 2 == 0)
even.take(11)
Use Reduce Method¶
sumRdd = ints.reduce(lambda a, b: a + b)
sumRdd
Use flatMap Method¶
flatMap method is useful when you want to make an iterator into flattened elements. See the following example, initially, each line in the RDD is an array of words, after we call flatMap, we have a new RDD with words. This is useful to count the frequency of words:
docsRdd = sc.parallelize([('w1', 'w2', 'w3'), ('w2', 'w3'), ('w4', 'w5'), ('w2', 'w4')])
docsRdd.take(4)
wordsRdd = docsRdd.flatMap(lambda line: line)
wordsRdd.take(10)
Use RDD reduceByKey method¶
To calculate the frequency of each word, we can use the reduceByKey method
freqRdd = wordsRdd.map(lambda word: (word, 1)).reduceByKey(lambda wfa, wfb: (wfa + wfb))
freqRdd.take(10)
Sort the words list by frequency¶
sortedRdd = freqRdd.sortBy(lambda (k, v): v, False)
sortedRdd.take(10)
Sort the words list alphabetically¶
sortAlpha = freqRdd.sortByKey()
sortAlpha.take(10)
In next post, I will use describe other Spark RDDs methods.