Spark Map-side Join | Learn for Master
  • Pyspark broadcast variable Example

    Pyspark broadcast variable

    Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The concept of Broadcast variables is simular to Hadoop’s distributed cache.

    When to use broadcast variable

    The best case to use Broadcast variable is when you want to join two tables and one of them is small. By using Broadcast variable, we can implement a map-side join, which is much faster than reduce side join, as there is no shuffle, which is expensive. 

    Suppose we have the following Rdd, and we want to make join with another Rdd.

    [Read More...]