In the previous post, we have already introduce Spark, RDD, and how to use RDD to do basic data analysis. In this post, I will show more examples on how to use the RDD method.
Spark RDD reduceByKey Method
We have used reduceByKey to solve the word frequency calculation problem. Here I will use a more complicated example to show how to use reduceByKey.
Suppose we have a set of tweets, each was shared by different users. We also give each user a weight denoting his importance.
Here is an example of the data set.
In this post, I briefly introduce Spark, and uses examples to show how to use the popular RDD method to analyze your data. You can refer to this post to setup the pySpark environment using Ipython Notebook.
SparkContext, or Spark context is the entry point to develop a spark application using the spark infrastructure.
Once a SparkContext object is created, it sets up the internal services and build a connection to the cluster managers, which manage the actual executors that conduct the specific computations.
The following diagram from the Spark documentation visualize the spark architecture:
The SparkContext object is usually referenced as the variable sc,