In this post, I first give a workable example to run pySpark on oozie. Then I show how to run pyspark on oozie using your own python installation (e.g., anaconda). In this way, you can use numpy, pandas, other python libraries in your pyspark program.
The syntax of creating a spark action on oozie workflow
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.3"> ... <action name="[NODE-NAME]"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>[JOB-TRACKER]</job-tracker> <name-node>[NAME-NODE]</name-node> <prepare> <delete path="[PATH]"/> ... <mkdir path="[PATH]"/> ... </prepare> <job-xml>[SPARK SETTINGS FILE]</job-xml> <configuration> <property> <name>[PROPERTY-NAME]</name> <value>[PROPERTY-VALUE]</value> </property> ... </configuration> <master>[SPARK MASTER URL]</master> <mode>[SPARK MODE]</mode> <name>[SPARK JOB NAME]</name> <class>[SPARK MAIN CLASS]</class> <jar>[SPARK DEPENDENCIES JAR / PYTHON FILE]</jar> <spark-opts>[SPARK-OPTIONS]</spark-opts> <arg>[ARG-VALUE]</arg> ... <arg>[ARG-VALUE]</arg> ... </spark> <ok to="[NODE-NAME]"/> <error to="[NODE-NAME]"/> </action> ... </workflow-app> |
As described in the document, here are the meanings of these elements.
The prepare element, if present, indicates a list of paths to delete or create before starting the job. Specified paths must start with hdfs://HOST:PORT .
[Read More...]