.
  • start pig using shell script

    Here is an example to start pig script from shell:

    bashprogname=basename $0

    # Various config params

    UDF_JARS_LOCATION=~/udf_jars

    DEFAULT_JOIN_NUM_REDUCERS=2048

    QUEUE=curveball_med

    # Prepare Pig command

    PIG_OPTS=“-Dmapred.job.queue.name=${QUEUE} -Dmapred.child.java.opts=-Xmx2840m -Dmapred.job.map.memory.mb=4548 -Dmapred.job.reduce.memory.mb=4548 -Dmapreduce.jobtracker.split.metainfo.maxsize=20000000 -Dmapreduce.job.acl-view-job=* -Dmapreduce.job.acl-modify-job=* -Dmapreduce.job.classloader=false -Dio.sort.mb=1000 -Dio.sort.factor=100 -Dpig.stats.noTaskReport=true -Dexectype=mapreduce -Dmapred.cache.archives=hdfs…/xxx.tar.gz#achievePrefix -Dmapred.child.env=LD_LIBRARY_PATH=./achievePrefix/xxx/lib64“

    # Pig params

    PIG_PARAMS=“ -param SOME_PATH=${SOME_PATH} -param DEFAULT_JOIN_NUM_REDUCERS=${DEFAULT_JOIN_NUM_REDUCERS} -param QUEUE=${QUEUE}“

    # Pig Jars location

    PIG_JARS_FALCON=“ -Dpig.additional.jars=local…./apps/lib/xxxx.jar:local…/lib/commons-math-2.2.jar$UDF_JARS_LOCATION/datafu-1.2.0.jar:$UDF_JARS_LOCATION/xxxxxx.jar:$UDF_JARS_LOCATION/zzzzz.jar“

    pig -useHCatalog –conf /…./hbase/hbase-site.xml -useversion 0.11 ${PIG_OPTS} ${PIG_PARAMS} ${PIG_JARS_FALCON} somePigScript.pig

    [Read More...]
  • Pig : Container is running beyond physical memory

    Pig : Container is running beyond physical memory limits using oozie

    Here is a good answer from: http://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits

    Looks like the default for YARN container size on that cluster (1 GB of RAM) is way too low for your job.

    But it’s not clear whether the YARN error shown relates to the Shell action (running grunt) or to a child MapReduce execution.

    Plan A – Assuming that it’s the child MapReduce execution that requires more RAM, on top of the grunt script just add

    Reference: Pig documentation,

    [Read More...]
  • Save data to Hive table Using Apache Pig

    We have described how to load data from Hive Table using Apache Pig, in this post, I will use an example to show how to save data to Hive table using Pig.

    Before save data to Hive, you need to first create a Hive Table. Please refer to this post on how to create a Hive table

    Suppose we use Apache Pig to Load some data from a text file, then we can save the data to the hive table using the following script. 

    The store_student.pig script is like this:

    Note: You must specify the table name in single quotes: STORE data into ‘tablename’.

    [Read More...]
  • Apache Pig Load ORC data from Hive Table

    There are some cases your data is stored in Hive Table, and you may want to process the data using Apache Pig. In this post, I use an example to describe how to read Hive ORC data using Apache Pig. 

    1. We first create Hive table stored as ORC, and load some data into the table.
    2. Then, we develop a Apache Pig script to load the data from the Hive ORC table. 

    Optimized Row Columnar (ORC) file format

    The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data.

    [Read More...]