Apache Pig Load ORC data from Hive Table
There are some cases your data is stored in Hive Table, and you may want to process the data using Apache Pig. In this post, I use an example to describe how to read Hive ORC data using Apache Pig.
- We first create Hive table stored as ORC, and load some data into the table.
- Then, we develop a Apache Pig script to load the data from the Hive ORC table.
Optimized Row Columnar (ORC) file format
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
Compared with RCFile format, ORC file format has many advantages such as:
- a single file as the output of each task, which reduces the NameNode’s load
- Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
- light-weight indexes stored within the file
- skip row groups that don’t pass predicate filtering
- seek to a given row
- block-mode compression based on data type
- run-length encoding for integer columns
- dictionary encoding for string columns
- concurrent reads of the same file using separate RecordReaders
- ability to split files without scanning for markers
- bound the amount of memory needed for reading or writing
- metadata stored using Protocol Buffers, which allows addition and removal of fields
Create a Hive table using ORC as storage format
File formats are specified at the table (or partition) level. You can specify the ORC file format with HiveQL statements such as these:
CREATE TABLE ... STORED AS ORC
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
SET hive.default.fileformat=Orc
The parameters are all placed in the TBLPROPERTIES
, They are:
Key |
Default |
Notes |
---|---|---|
orc.compress |
ZLIB |
high level compression (one of NONE, ZLIB, SNAPPY) |
orc.compress.size |
262,144 |
number of bytes in each compression chunk |
orc.stripe.size |
67,108,864 |
number of bytes in each stripe |
orc.row.index.stride |
10,000 |
number of rows between index entries (must be >= 1000) |
orc.create.index |
true |
whether to create row indexes |
orc.bloom.filter.columns | “” | comma separated list of column names for which bloom filter should be created |
orc.bloom.filter.fpp | 0.05 | false positive probability for bloom filter (must >0.0 and <1.0) |
The following example shows how to create a hive ORC table using ZLIB compression.
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE student_orc( `id` int, `name` string, `age` int, `score` float) COMMENT 'a student orc table using ZLIB compression method' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS orc tblproperties ("orc.compress"="ZLIB"); |
Then we load some data into it.
1 2 3 4 |
-bash-4.1$ cat student.txt 1 san 19 89 2 LILI 23 90 3 Jim 20 99 |
The method to load text file to ORC Hive table is describe in this post: How to load data from a text file to Hive table
There are two steps:
- Create a tmp HIVE table saved as text file, then loaded the text file to this Table.
- Create a HIVE table saved as ORC, then copy all the data from the text table to the ORC table.
Create a tmp table saved as text file:
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE student_txt( `id` int, `name` string, `age` int, `score` float) COMMENT 'a student table stored as text file' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS textfile; |
Load data from file to the text Hive table:
1 |
<span class="s1">hive> </span><span class="s1">load data local inpath 'file:/homes/user/hive/student.txt' into table </span>student_txt; |
1 2 3 4 5 6 |
<span class="s1">hive> select * from student_txt; OK 1 san 19 89.0 2 LILI 23 90.0 3 Jim 20 99.0 Time taken: 0.252 seconds, Fetched: 3 row(s)</span> |
Copy the data from the text Hive table to the ORC Hive table.
1 |
hive> insert into student_orc select * from student_txt; |
1 2 3 4 5 6 |
hive> select * from student_orc; OK 1 san 19 89.0 2 LILI 23 90.0 3 Jim 20 99.0 Time taken: 0.069 seconds, Fetched: 3 row(s) |
Using Apache Pig to Load data from Hive ORC Table
Running Pig with HCatalog
Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, you can either use a flag in the pig command or set the environment variables PIG_CLASSPATH and PIG_OPTS as described below.
The -useHCatalog Flag
To bring in the appropriate jars for working with HCatalog, simply include the following flag:
1 |
pig --useversion 0.11 -useHCatalog <file.pig> |
For Pig commands that omit -useHCatalog
, you need to tell Pig where to find your HCatalog jars and the Hive jars used by the HCatalog client. To do this, you must define the environment variable PIG_CLASSPATH
with the appropriate jars.
HCatalog
can tell you the jars it needs. In order to do this it needs to know where Hadoop and Hive are installed. Also, you need to tell Pig the URI for your metastore, in the PIG_OPTS variable.
In the case where you have installed Hadoop and HCatalog via tar, you can do this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
export HADOOP_HOME=<path_to_hadoop_install> export HCAT_HOME=<path_to_hcat_install> export HIVE_HOME=<path_to_hive_install> export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-*.jar:\ $HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\ $HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\ $HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:\ $HIVE_HOME/lib/slf4j-api-*.jar export PIG_OPTS=-Dhive.metastore.uris=hdfs://<hostname>:<port> |
Or you can pass the jars in your command line:
1 2 3 4 5 |
<path_to_pig_install>/bin/pig -Dpig.additional.jars=\ $HCAT_HOME/share/hcatalog/hcatalog-*.jar:\ $HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\ $HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\ $HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/lib/slf4j-api-*.jar <script.pig> |
In the Pig Script, we can load the Hive Table using org.apache.hive.hcatalog.pig.HCatLoader
1 2 3 4 |
content of student.pig student = LOAD 'students_db.student_orc' USING org.apache.hive.hcatalog.pig.HCatLoader(); dump student; |
Here students_db
is the Database name, and student_orc
is the Hive Table.
Run the pig program using with useHCatalog
:
1 |
pig --useversion 0.11 -useHCatalog <file.pig> |
The output will be something like:
1 2 3 |
(1,san,19,89.0) (2,LILI,23,90.0) (3,Jim,20,99.0) |
Our next post will describe how to save data into Hive table using Apache Pig.
Pingback: Save data to Hive table Using Apache Pig | Learn for Master()