Pyspark broadcast variable
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The concept of Broadcast variables is simular to Hadoop’s distributed cache.
When to use broadcast variable
The best case to use Broadcast variable is when you want to join two tables and one of them is small. By using Broadcast variable, we can implement a map-side join, which is much faster than reduce side join, as there is no shuffle, which is expensive.
Suppose we have the following Rdd, and we want to make join with another Rdd.[Read More...]