Best resources to learn Hive partition

Terminology

  • Static Partition (SP) columns: in DML/DDL involving multiple partitioning columns, the columns whose values are known at COMPILE TIME (given by user).
  • Dynamic Partition (DP) columns: columns whose values are only known at EXECUTION TIME.

Syntax

DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns.

In INSERT … SELECT … queries, the dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.

  • all DP columns – only allowed in nonstrict mode. In strict mode, we should throw an error. e.g.,
INSERT OVERWRITE TABLE T PARTITION (ds, hr)
SELECT key, value, ds, hr FROM srcpart WHERE ds is not null and hr>10;
  • mixed SP & DP columns. e.g.,
INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr)
SELECT key, value, /*ds,*/ hr FROM srcpart WHERE ds is not null and hr>10;
  • SP is a subpartition of a DP: should throw an error because partition column order determins directory hierarchy. We cannot change the hierarchy in DML. e.g.,
-- throw an exception
INSERT OVERWRITE TABLE T PARTITION (ds, hr = 11)
SELECT key, value, ds/*, hr*/ FROM srcpart WHERE ds is not null and hr=11;
  • multi-table insert. e.g.,
FROM S
INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr)
SELECT key, value, ds, hr FROM srcpart WHERE ds is not null and hr>10
INSERT OVERWRITE TABLE R PARTITION (ds='2010-03-03, hr=12)
SELECT key, value, ds, hr from srcpart where ds is not null and hr = 12;
  • CTAS – syntax is a little bit different from CTAS on non-partitioned tables, since the schema of the target table is not totally derived from the select-clause. We need to specify the schema including partitioning columns in the create-clause. e.g.,
CREATE TABLE T (key int, value string) PARTITIONED BY (ds string, hr int) AS
SELECT key, value, ds, hr+1 hr1 FROM srcpart WHERE ds is not null and hr>10;

The above example shows the case of all DP columns in CTAS. If you want put some constant for some partitioning column, you can specify it in the select-clause. e.g,

CREATE TABLE T (key int, value string) PARTITIONED BY (ds string, hr int) AS
SELECT key, value, "2010-03-03", hr+1 hr1 FROM srcpart WHERE ds is not null and hr>10;

Design issues

1) Data type of the dynamic partitioning column:
A dynamic partitioning column could be the result of an expression. For example:

-- part_col is partitioning column
create table T as select a, concat("part_", part_col) from S where part_col is not null;

Although currently there is not restriction on the data type of the partitioning column, allowing non-primitive columns to be partitioning column probably doesn’t make sense. The dynamic partitioning column’s type should be derived from the expression. The data type has to be able to be converted to a string in order to be saved as a directory name in HDFS.

2) Partitioning column value to directory name conversion:
After converting column value to string, we still need to convert the string value to a valid directory name. Some reasons are:

  • string length is unlimited in theory, but HDFS/local FS directory name length is limited.
  • string value could contains special characters that is reserved in FS path names (such as ‘/’ or ‘..’).
  • what should we do for partition column ObjectInspector?

We need to define a UDF (say hive_qname_partition(T.part_col)) to take a primitive typed value and convert it to a qualified partition name.

3) Due to 2), this dynamic partitioning scheme qualifies as a hash-based partitioning scheme, except that we define the hash function to be as close as
the input value. We should allow users to plugin their own UDF for the partition hash function. Will file a follow up JIRA if there is sufficient interests.

4) If there are multiple partitioning columns, their order is significant since that translates to the directory structure in HDFS: partitioned by (ds string, dept int) implies a directory structure of ds=2009-02-26/dept=2. In a DML or DDL involving partitioned table, So if a subset of partitioning columns are specified (static), we should throw an error if a dynamic partitioning column is lower. Example:

create table nzhang_part(a string) partitioned by (ds string, dept int);
insert overwrite nzhang_part (dept=1) select a, ds, dept from T where dept=1 and ds is not null;
 https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions

An Introduction to Hive’s Partitioning

You’re probably thinking about building a data warehouse (just about every company is if they haven’t already). After reading SQL Server Partitioning: Not the Best Practice for Anyone and Potential Problems with Partitioning you’re wondering why anyone would partition their data: it can be harder to tune queries, indexes take up more space, and SQL Server’s partitioning requires Enterprise Edition on top of that expensive SAN you’re adding to cope with the extra space. Anyone who is looking at implementing table partitioning in SQL Server would do well to take a look at using Hive for their partitioned database.

Partitioning Functions

Setting up partitioning functions in SQL Server is a pain. It’s left up to the implementor to decide if the partition function should use range right or range left and how partitions will be swapped in and out. Writing robust partitioning functions is stressful the first time around. What if we didn’t have to define a partition function? What if the database knew how to handle partitioning for us? Hive does just that.

Rather than leave the table partitioning scheme up to the implementor, Hive makes it easy to specify an automatic partition scheme when the table is created:

As we load data it is written to the appropriate partition in the table. There’s no need to create partitions in advance or set up any kind of partition maintenance; Hive does the hard work for us. The hardest part is writing queries. It’s a rough life, eh?

You might have noticed that while the partitioning key columns are a part of the table DDL, they’re only listed in the PARTITIONED BY clause. This is very different from SQL Server where the partitioning key must be used everywhere in a partitioned table. In Hive, as data is written to disk, each partition of data will be automatically split out into different folders, e.g. country=US/year=2012/month=12/day=22. During a read operation, Hive will use the folder structure to quickly locate the right partitions and also return the partitioning columns as columns in the result set.

This approach means that we save a considerable amount of space on disk and it can be very fast to perform partition elimination. The downside of this approach is that it’s necessary to tell Hive which partition we’re loading in a query. To add data to the partition for the United States on December 22, 2012 we have to write this query:

This is a somewhat inflexible, but effective, approach. Hive makes it difficult to accidentally create tens of thousands of partitions by forcing users to list the specific partition being loaded. This approach is great once you’re using Hive in production but it can be tedious to initially load a large data warehouse when you can only write to one partition at a time. There is a better way.

Automatic Partitioning

With a few quick changes it’s easy to configure Hive to support dynamic partition creation. Just as SQL Server has a SETcommand to change database options, Hive lets us change settings for a session using the SET command. Changing these settings permanently would require opening a text file and restarting the Hive cluster – it’s not a difficult change, but it’s outside of our scope.

Once both of these settings are in place, it’s easy to change our query to dynamically load partitions. Instead of loading partitions one at a time, we can load an entire month or an entire country in one fell swoop:

When inserting data into a partition, it’s necessary to include the partition columns as the last columns in the query. The column names in the source query don’t need to match the partition column names, but they really do need to be last – there’s no way to wire up Hive differently.

Be careful using dynamic partitions. Hive has some built-in limits on the number of partitions that can be dynamically created as well as limits on the total number of files that can exist within Hive. Creating many partitions at once will create a lot of files and creating a lot of files will use up memory in the Hadoop Name Node. All of these settings can be changed from their defaults, but those defaults exist to prevent a single INSERT from taking down your entire Hive cluster.

If the number of partitions rises above a certain threshold (in part based on the number of underlying files), you can run into out of memory errors when MapReduce jobs are being generated. In these conditions, even simple SELECTstatements can fail. Until the underlying problems are fixed, there are a few workarounds:

  1. Tune Java heap size (not for the faint of heart)
  2. Find ways to reduce the number of underlying files. This can happen by manipulating load processes to use a single reducer.
  3. Modify the partition scheme to use fewer partitions. With a rough threshold of 10,000 partitions, most partition schemes can be accommodated. Remember – each partition should contain about 64MB of data, minimum.

What About Partition Swapping?

Much like SQL Server, Hive makes it possible to swap out partitions. Partition swapping is an important feature that makes it easy to change large amounts of data with a minimal impact on database performance. New aggregations can be prepared in the background

How do we perform a partition swap with Hive? A first guess might be to use the INSERT OVERWRITE PARTITIONcommand to replace all data in a partition. This works but it has the downside of deleting all of the data and then re-inserting it. Although Hive has no transaction log, we’ll still have to wait for data to queried and then written to disk. Your second guess might be to load data into a different location, drop the original partition, and then point Hive at the new data like this:

It’s that easy: we’ve swapped out a partition in Hive and removed the old data in one step. . Truthfully, there’s an even easier way using the SET LOCATION clause of ALTER TABLE.

Just like that, the new partition will be used. There’s one downside to this approach – the old data will still exist in Hadoop, only the metadata will be changed. If we want to clear out the old data, it’s going to be necessary to run drop down to HDFS commands and delete the old data out of Hadoop itself.

Is Hive Partitioning Right For You?

If you’re thinking about partitioning a relational database, you should give serious consideration to using partitioned tables in Hive. One of the advantages of Hive is that storage and performance can be scaled horizontally by adding more servers to the cluster – if you need more space, just add a server; if you need more computing power, just add a server. Hive’s approach to data skips some of the necessary costs of partitioning in SQL Server – there’s no Enterprise Edition to purchase, minimal query tuning involved (hint: you should almost always partition your data in Hive), and no expensive SAN to purchase. Keep in mind – you don’t need to use partitioning to get started with Hive, either. You can always partition after the fact.

For better or for worse – if you’re thinking about partitioning a data warehouse in SQL Server, you should think about using Hive instead. 

https://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/

 

x