Save data to Hive table Using Apache Pig
We have described how to load data from Hive Table using Apache Pig, in this post, I will use an example to show how to save data to Hive table using Pig.
Before save data to Hive, you need to first create a Hive Table. Please refer to this post on how to create a Hive table.
Suppose we use Apache Pig to Load some data from a text file, then we can save the data to the hive table using the following script.
1 2 3 4 |
-bash-4.1$ cat student.txt 1 san 19 89 2 LILI 23 90 3 Jim 20 99 |
The store_student.pig script is like this:
1 2 |
data = load 'student.txt' Using PigStorage('\t') as (id:Int, name:chararray, age:Int, score:Float); STORE data into 'students_db.student_orc' USING org.apache.hive.hcatalog.pig.<code>HCatStorer()</code>; |
Note: You must specify the table name in single quotes: STORE data into 'tablename'.
Both the database and table must be created prior to running your Pig script. If you are using a non-default database you must specify your input as ‘dbname.tablename'.
The Hive metastore lets you create tables without specifying a database; if you created tables this way, then the database name is ‘default’ and you do not need to specify the database name in the store statement.
Exception when Append data tot a non-partitioned table using HCatalog.
After run the program using the following command:
1 |
pig -useHCatalog store_student.pig |
You may get this error if your table is a non-partitioned table ( or a external table) .
1 2 |
Output Location Validation Failed for: 'students_db.student_orc More info to follow: org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data |
Immutale in HCatalog
immutable
” – i.e. all tables and partitions can be written to only once, and not appended. This means:- A non-partitioned table can be written to, and data in it is never updated from then on unless you drop and recreate.
- A partitioned table may support “appending” of a sort in a manner by
adding new partitions to the table,
but once written, the partitions themselves cannot have any new data added to them.
1 |
store data into 'hive_dbs.hive_table' using org.apache.hive.hcatalog.pig.HCatStorer('datestamp=20160324'); |
1 2 |
store data into 'hive_dbs.hive_table'using org.apache.hive.hcatalog.pig.HCatStorer(); -- datestamp must be a field in the relation hive_table |
Use HCatalog to save data into Hive Table from Apache Pig
Suppose we have a partitioned Hive Table named imps_part
(see how to create a partitioned hive table and populate it), and we have some data like this:
id | user_id | user_lang | user_device | time_stamp | url | date | country |
9 | u3 | en | iphone | 201503250011 | http://xxx/xxx/9 | 20150325 | CA |
10 | u3 | en | ipad | 201503250111 | http://xxx/xxx/10 | 20150325 | CA |
11 | u4 | en | desktop | 201503270051 | http://xxx/xxx/11 | 20150327 | HK |
12 | u5 | en | iphone | 201503280021 | http://xxx/xxx/12 | 20150328 | US |
Using the following Pig script (store.pig), we can insert the data to the Hive table.
1 |
data = load '<code>$DATA</code>' Using PigStorage(',') as (id:Int, user_id:chararray,user_lang:chararray,user_device:chararray,time_stamp:chararray,url:chararray,date:chararray,country:chararray); STORE data into 'imps_db.imps_part' USING org.apache.hive.hcatalog.pig.HCatStorer(); |
We can run the program using the following command:
pig -useHCatalog -param DATA=/tmp/user/imps.txt store.pig
Unsupported Type Exception:
If you use BIGINT
in hive table, you will get the following exception.
1 |
Unsupported type: 10 in Pig's schema |
This occurred you had an INT
and tried to store it in a Hive table, where the corresponding column was a BIGINT
.
The solution is to change the table by replacing BIGINT
with INT,
after which the store went well.
Append data to the partition with data.
Some times we want to append data to the partition which is not empty, we will get the following exception:
1 |
Caused by: org.apache.hive.hcatalog.common.HCatException : 2002 : Partition already present with given partition key values : Data already exists in hdfs://xxxx:xxx/tmp/user/hivedb/imps_part/date=20150325/country=CA, duplicate publish not possible |
We can create a partition key that called run_number. Then pass the value of this parameter via the command line and set it in the store statement. Like so:
create table stage.partition_pk (value string) Partitioned(date string,run_number string) stored as orc;
LoadFile = Load 'xxx' USING PigStorage(',');
STORE LoadFile into 'partition_pk' using org.apache.hive.hcatalog.pig.HCatStorer('run_number=${run_number}';
When you set the run_number to a timestamp, you can always insert data to the Table.
Pingback: An example to create a partitioned hive table | Learn for Master()
Pingback: Apache Pig Load ORC data from Hadoop Hive | Learn for Master()