An Example to Create a Partitioned Hive Table

Tags: ,

Partition is a very useful feature of Hive. Without partition, it is hard to reuse the Hive Table if you use HCatalog to store data to Hive table using Apache Pig, as you will get exceptions when you insert data to a non-partitioned Hive Table that is not empty

 In this post, I use an example to show how to create a partitioned table, and populate data into it. 

Let’s suppose you have a dataset for user impressions. For instance, a sample of the data set might be like this:

id user_id user_lang user_device time_stamp url date country
1 u1 en iphone 201503210011 http://xxx/xxx/1 20150321 US
2 u1 en ipad 201503220111 http://xxx/xxx/2 20150322  US
3 u2 en desktop 201503210051 http://xxx/xxx/3  20150321 CA
4 u3 en iphone 201503230021 http://xxx/xxx/4  20150323 HK

If you use Pig to analyze the data, you can store the data into different folders as Hadoop hdfs files. For instance, we may have the following three folders, each folder stores the corresponding impressions data. 

When we use Pig to load the data, we can use the date to determine which part of the data should be loaded by specifying the path like 

In Hive, we can accomplish such a task use the partition feature of Hive. 

Here is the query to create a partitioned Hive Table:

Be careful: While the partitioning key columns are a part of the table DDL, they’re only listed in the PARTITIONED BY clause.  In Hive, as data is written to disk, each partition of data will be automatically split out into different folders, e.g.date=20150321/country=us. During a read operation, Hive will use the folder structure to quickly locate the right partitions and also return the partitioning columns as columns in the result set.

This approach can save space on disk and it can also be fast to perform partition elimination. The downside of this approach is that it’s necessary to tell Hive which partition we’re loading in a query. To add data to the partition for the date ‘20150321’, country with ‘US’, we have to write this query:

It is easy to load the data into a non partitioned table,  then we can use that table to populate the partitioned table. 

Suppose we have the non-partitioned table impr created using the following query:

We can load the above data into the Hive Table imps using the following command:

Then we can load all the data into the partitioned Hive Table:

Now we have one row in the table:

The path for the Hive Table looks like this:

We can see that the two partition names are in the path of the data location. 

Static PARTITIONING

Such a method is call static partitioning.  This method enable you control which part of data will be inserted into the table. It works well when you’re using Hive in production, but it is tedious to initially load a large data warehouse when you can only write to one partition at a time. There is a better way called automatic Partitioning.

AUTOMATIC PARTITIONING

To enable automatic partitioning, we need to run the following commands:

Then it’s easy to change the query to dynamically load partitions, for instance, we can load all the data in one run:

Let’s first delete the rows in the Hive Table:

Then we run the following commands to insert all the data in the non-partitioned Hive table (imps) into the partitioned Hive table (imps_part):

Exception in the strict model:

If you have not run: SET hive.exec.dynamic.partition.mode = nonstrict;You will get the flowing exception:FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

After set the model to nonstrict, we can see that the data are successfully populated into the partitioned-table in just one run.

Now we can even append data to the partitioned Hive table, which can not be done when using the none-partitioned table.