Apache Pig Load ORC data from Hive Table

Tags: , , ,

There are some cases your data is stored in Hive Table, and you may want to process the data using Apache Pig. In this post, I use an example to describe how to read Hive ORC data using Apache Pig. 

  1. We first create Hive table stored as ORC, and load some data into the table.
  2. Then, we develop a Apache Pig script to load the data from the Hive ORC table. 

Optimized Row Columnar (ORC) file format

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

Compared with RCFile format, ORC file format has many advantages such as:

  • a single file as the output of each task, which reduces the NameNode’s load
  • Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
  • light-weight indexes stored within the file
    • skip row groups that don’t pass predicate filtering
    • seek to a given row
  • block-mode compression based on data type
    • run-length encoding for integer columns
    • dictionary encoding for string columns
  • concurrent reads of the same file using separate RecordReaders
  • ability to split files without scanning for markers
  • bound the amount of memory needed for reading or writing
  • metadata stored using Protocol Buffers, which allows addition and removal of fields

Create a Hive table using ORC as storage format

File formats are specified at the table (or partition) level. You can specify the ORC file format with HiveQL statements such as these:

  • CREATE TABLE ... STORED AS ORC
  • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
  • SET hive.default.fileformat=Orc

The parameters are all placed in the TBLPROPERTIES, They are:

Key

Default

Notes

orc.compress

ZLIB

high level compression (one of NONE, ZLIB, SNAPPY)

orc.compress.size

262,144

number of bytes in each compression chunk

orc.stripe.size

67,108,864

number of bytes in each stripe

orc.row.index.stride

10,000

number of rows between index entries (must be >= 1000)

orc.create.index

true

whether to create row indexes

orc.bloom.filter.columns “” comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp 0.05 false positive probability for bloom filter (must >0.0 and <1.0)

The following example shows how to create a hive ORC table using ZLIB compression. 

Then we load some data into it.

The method to load text file to ORC Hive table is describe in this post: How to load data from a text file to Hive table

There are two steps:

  1. Create a tmp HIVE table saved as text file, then loaded the text file to this Table.
  2. Create a HIVE table saved as ORC, then copy all the data from the text table to the ORC table. 
Create a tmp table saved as text file:

Load data from file to the text Hive table:

Copy the data from the text Hive table to the ORC Hive table. 

Using Apache Pig to Load data from Hive ORC Table

Running Pig with HCatalog

Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, you can either use a flag in the pig command or set the environment variables PIG_CLASSPATH and PIG_OPTS as described below.

The -useHCatalog Flag

To bring in the appropriate jars for working with HCatalog, simply include the following flag:

For Pig commands that omit -useHCatalog, you need to tell Pig where to find your HCatalog jars and the Hive jars used by the HCatalog client. To do this, you must define the environment variable PIG_CLASSPATH with the appropriate jars.

HCatalog can tell you the jars it needs. In order to do this it needs to know where Hadoop and Hive are installed. Also, you need to tell Pig the URI for your metastore, in the PIG_OPTS variable.

In the case where you have installed Hadoop and HCatalog via tar, you can do this:

Or you can pass the jars in your command line:

In the Pig Script, we can load the Hive Table using org.apache.hive.hcatalog.pig.HCatLoader

Here students_db is the Database name, and student_orc is the Hive Table. 

Run the pig program using with useHCatalog:

The output will be something like:

Our next post will describe how to save data into Hive table using Apache Pig.