An Example to Create a Partitioned Hive Table
Partition is a very useful feature of Hive. Without partition, it is hard to reuse the Hive Table if you use HCatalog to store data to Hive table using Apache Pig, as you will get exceptions when you insert data to a non-partitioned Hive Table that is not empty.
In this post, I use an example to show how to create a partitioned table, and populate data into it.
Let’s suppose you have a dataset for user impressions. For instance, a sample of the data set might be like this:
id | user_id | user_lang | user_device | time_stamp | url | date | country |
1 | u1 | en | iphone | 201503210011 | http://xxx/xxx/1 | 20150321 | US |
2 | u1 | en | ipad | 201503220111 | http://xxx/xxx/2 | 20150322 | US |
3 | u2 | en | desktop | 201503210051 | http://xxx/xxx/3 | 20150321 | CA |
4 | u3 | en | iphone | 201503230021 | http://xxx/xxx/4 | 20150323 | HK |
If you use Pig to analyze the data, you can store the data into different folders as Hadoop hdfs files. For instance, we may have the following three folders, each folder stores the corresponding impressions data.
1 |
$HDFS_PATH/20150321, $HDFS_PATH/20150322 and $HDFS_PATH/20150323 |
When we use Pig to load the data, we can use the date
to determine which part of the data should be loaded by specifying the path like
1 |
$HDFS_PATH/{20150321,20150322} |
In Hive, we can accomplish such a task use the partition feature of Hive.
Here is the query to create a partitioned Hive Table:
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE TABLE imps_part ( id INT, user_id String, user_lang STRING, user_device STRING, time_stamp String, url String ) PARTITIONED BY (date STRING, country String) row format delimited fields terminated by ',' stored as textfile; |
Be careful: While the partitioning key columns are a part of the table DDL, they’re only listed in the PARTITIONED BY
clause. In Hive, as data is written to disk, each partition of data will be automatically split out into different folders, e.g.date=20150321/country=us
. During a read operation, Hive will use the folder structure to quickly locate the right partitions and also return the partitioning columns as columns in the result set.
This approach can save space on disk and it can also be fast to perform partition elimination. The downside of this approach is that it’s necessary to tell Hive which partition we’re loading in a query. To add data to the partition for the date ‘20150321’, country with ‘US’, we have to write this query:
It is easy to load the data into a non partitioned table, then we can use that table to populate the partitioned table.
Suppose we have the non-partitioned table impr created using the following query:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE TABLE imps( id INT, user_id String, user_lang STRING, user_device STRING, time_stamp String, url String, date String, country String ) row format delimited fields terminated by ',' stored as textfile; |
We can load the above data into the Hive Table imps using the following command:
1 2 |
USE DATABASE students_db; load data local inpath 'file:/homes/user/hive/imprs.txt' into table imps; |
Then we can load all the data into the partitioned Hive Table:
1 2 3 4 5 6 |
INSERT INTO imps_part PARTITION (date = '20150321', country='us') SELECT id, user_id, user_lang, user_device, time_stamp, url FROM imps WHERE date = '20150321' AND country = 'US' ; |
Now we have one row in the table:
1 2 3 4 5 |
hive> select * from imps_part > ; OK 1 u1 en iphone 201503210011 http://xxx/xxx/1 20150321 us Time taken: 0.076 seconds, Fetched: 1 row(s) |
The path for the Hive Table looks like this:
1 2 3 4 |
-bash-4.1$ hdfs dfs -ls /tmp/user/hivedb/imps_part/date=20160321/country=US Found 2 items -rwxrwxrwx 3 /tmp/user/hivedb/imps_part/date=20160321/country=US/000000_0.deflate -rwxrwxrwx 3 /tmp/user/hivedb/imps_part/date=20160321/country=US/000001_0.deflate |
We can see that the two partition names are in the path of the data location.
Static PARTITIONING
Such a method is call static partitioning. This method enable you control which part of data will be inserted into the table. It works well when you’re using Hive in production, but it is tedious to initially load a large data warehouse when you can only write to one partition at a time. There is a better way called automatic Partitioning.
AUTOMATIC PARTITIONING
To enable automatic partitioning, we need to run the following commands:
1 2 3 |
SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; |
Then it’s easy to change the query to dynamically load partitions, for instance, we can load all the data in one run:
Let’s first delete the rows in the Hive Table:
1 2 3 |
INSERT OVERWRITE TABLE imps_part PARTITION(date = '20150321', country = 'US') SELECT </code><code>id, user_id, user_lang, user_device, time_stamp, url FROM imps_part; |
Then we run the following commands to insert all the data in the non-partitioned Hive table (imps) into the partitioned Hive table (imps_part):
1 2 3 4 |
INSERT INTO imps_part PARTITION (date, country) SELECT id, user_id, user_lang, user_device, time_stamp, url, date, country FROM imps; |
Exception in the strict model:
If you have not run: SET hive.exec.dynamic.partition.mode = nonstrict;
You will get the flowing exception:FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
After set the model to nonstrict, we can see that the data are successfully populated into the partitioned-table in just one run.
1 2 3 4 5 6 7 |
-bash-4.1$ hls /tmp/user/hivedb/imps_part/*/*/* | grep /tmp/user/hivedb/imps_part | awk '{print $8 }' /tmp/user/hivedb/imps_part/date=20150321/country=CA/000000_0.deflate /tmp/user/hivedb/imps_part/date=20150321/country=US/000000_0.deflate /tmp/user/hivedb/imps_part/date=20150322/country=US/000000_0.deflate /tmp/user/hivedb/imps_part/date=20150323/country=HK/000000_0.deflate /tmp/user/hivedb/imps_part/date=date/country=country/000000_0.deflate |
Now we can even append data to the partitioned Hive table, which can not be done when using the none-partitioned table.
1 |
hive> INSERT OVERWRITE TABLE imps_part PARTITION (date = '20150322', country='US') SELECT id, user_id, user_lang, user_device, time_stamp, url FROM imps WHERE date = '20150322' AND country = 'US'; |
Pingback: Save data to Hive table Using Apache Pig | Learn for Master()