Save data to Hive table Using Apache Pig

Tags: , ,

We have described how to load data from Hive Table using Apache Pig, in this post, I will use an example to show how to save data to Hive table using Pig.

Before save data to Hive, you need to first create a Hive Table. Please refer to this post on how to create a Hive table

Suppose we use Apache Pig to Load some data from a text file, then we can save the data to the hive table using the following script. 

The store_student.pig script is like this:

Note: You must specify the table name in single quotes: STORE data into 'tablename'. Both the database and table must be created prior to running your Pig script. If you are using a non-default database you must specify your input as ‘dbname.tablename'.

The Hive metastore lets you create tables without specifying a database; if you created tables this way, then the database name is ‘default’ and you do not need to specify the database name in the store statement.

Exception when Append data tot a non-partitioned table using HCatalog.

After run the program using the following command:

You may get this error if your table is a non-partitioned table ( or a external table)

Immutale in HCatalog

HCatalog currently treats all tables as “immutable” – i.e. all tables and partitions can be written to only once, and not appended. This means:
  • A non-partitioned table can be written to, and data in it is never updated from then on unless you drop and recreate.
  • A partitioned table may support “appending” of a sort in a manner by adding new partitions to the table, but once written, the partitions themselves cannot have any new data added to them.
Hive, on the other hand, does allow us to “INSERT INTO” into a table, thus allowing us append semantics. There is benefit to both of these models, and so, our goal is as follows:
To add one new partition to a partitioned table, specify the partition value in the store function. Please note,  the whole string must be single quoted and separated with an equals sign:
To write into multiple partitions at once, make sure that the partition column is present in your data, then call HCatStorer with no argument:

Use HCatalog to save data into Hive Table from Apache Pig

Suppose we have a partitioned Hive Table named imps_part (see how to create a partitioned hive table and populate it),  and we have some data like this:

id user_id user_lang user_device time_stamp url date country
9 u3 en iphone 201503250011 http://xxx/xxx/9 20150325 CA
10 u3 en ipad 201503250111 http://xxx/xxx/10 20150325  CA
11 u4 en desktop 201503270051 http://xxx/xxx/11  20150327 HK
12 u5 en iphone 201503280021 http://xxx/xxx/12  20150328 US

Using the following Pig script (store.pig), we can insert the data to the Hive table.  

We can run the program using the following command:

pig -useHCatalog -param DATA=/tmp/user/imps.txt store.pig


Unsupported Type Exception:

If you use BIGINT in hive table, you will get the following exception.

This occurred you had an INT and tried to store it in a Hive table, where the corresponding column was a BIGINT.

The solution is to change the table by replacing BIGINT with INT, after which the store went well.

Append data to the partition with data. 

Some times we want to append data to the partition which is not empty, we will get the following exception:

We can create a partition key that called  run_number. Then pass the value of this parameter via the command line and set it in the store statement. Like so:

create table stage.partition_pk (value string) Partitioned(date string,run_number string) stored as orc;

LoadFile = Load 'xxx' USING PigStorage(',');

STORE LoadFile into 'partition_pk' using org.apache.hive.hcatalog.pig.HCatStorer('run_number=${run_number}';

When you set the run_number to a timestamp, you can always insert data to the Table.