When we use Hive, we can append data to the table easily, but when we use Pig (i.e., the HCatalog ) to insert data into Hive table, we are not allowed to append data to a partition if that partition already contains data.
In this post, I describe a method that can help you append data to the existing partition using a dummy partition named run. It means the run number you append some data to this partition.
For example, we create the following partitioned hive table:
1 2 3 4 5 6 7 8 9 |
CREATE TABLE table_part ( id INT, user_id String, url String ) PARTITIONED BY (date STRING, run STRING) row format delimited fields terminated by ',' stored as textfile; |
Then pig script looks like the following:
1 2 3 4 |
-bash-4.1$ cat save_simple.pig data = load 'hdfs:hostname:port/tmp/user/simple.txt' Using PigStorage(',') as (id:int, user_id:chararray, url:chararray); --dump data; store data into 'cb_mappi_db.table_part' using org.apache.hive.hcatalog.pig.HCatStorer('date=${DATE}, run=${RUN}'); |
Now we can run the pig script using the following command:
1 2 3 4 |
-bash-4.1$ cat run_save_simple.sh pig -useHCatalog -param DATE=20160605 \ -param RUN=1 \ save_simple.pig |
Then we have the following content in the table:
1 2 3 4 5 6 7 |
hive> select * from table_part; select * from table_part; OK 1 u1 url1 20160605 1 2 u2 url2 20160605 1 1 u1 url2 20160605 1 2 u3 url4 20160605 1 |
Each time when you want to append data to the partition DATE=20160605,
[Read More...]