Hive partitioning vs Bucketing
Hive Bucketing and Partitioning
To better understand how partitioning and bucketing works, please take a look at how data is stored in hive. Let’s say you have a table
CREATE TABLE mytable (
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
You insert some data into a partition for 2015-12-02. Hive will then store data in a directory hierarchy, such as:
As such, it is important to be careful when partitioning. As a general rule of thumb, when choosing a field for partitioning, the field should not have a high cardinality – the term ‘cardinality‘ refers to the number of possible values a field can have. For instance, if you have a ‘country’ field, the countries in the world are about 300, so cardinality would be ~300. For a field like ‘timestamp_ms’, which changes every millisecond, cardinality can be billions. The cardinality of the field relates to the number of directories that could be created on the file system. As an example, if you partition by employee_id and you have millions of employees, you may end up having millions of directories in your file system.
Clustering, aka bucketing, on the other hand, will result in a fixed number of files, since you specify the number of buckets. What hive will do is to take the field, calculate a hash and assign a record to that bucket.
What happens if you use e.g. 256 buckets and the field you’re bucketing on has a low cardinality (for instance, it’s a US state, so can be only 50 different values?
You’ll have 50 buckets with data, and 206 buckets with no data.
Can partitions dramatically cut the amount of data that is being queried?
In the example table, if you want to query only from a certain date forward, the partitioning by year/month/day is going to dramatically cut the amount of IO.
Can bucketing can speed up joins with other tables that have exactly the same bucketing?
In the above example, if you’re joining two tables on the same employee_id, hive can do the join bucket by bucket (even better if they’re already sorted by employee_id since it’s going to do a mergesort which works in linear time).
So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.
Also, you can partition on multiple fields, with an order (year/month/day is a good example), while you can bucket on only one field.
Problem: How do I determine number of Buckets in Hive?
One of the things about buckets is that 1 bucket = at least 1 file in HDFS. So if you have a lot of small buckets, you have very inefficient storage of data resulting in a lot of unnecessary disk I/O.
Therefore, you will want your number of buckets to result in files that are about the same size as your HDFS block size, or bigger.
In addition, when performing INSERT INTO SELECT operations, each bucket will result in 1 reducer job. This is important to keep in mind with regards to performance.
Partitions & Buckets in #Hive
This post will introduce partitioning and bucketing as method for segmenting large data sets to improve query performance.
If you have previous experience working in the relational database world then inevitably the concept of partitions and partitioning is not new. Partitions are fundamentally horizontal slices of data which allow larges sets of data to be segmented into more manageable chunks.
Partitioning in this manner takes many different forms with boundaries defined on either a single or range of values for the one or more columns that act as a splitter. This is commonly seen in a data warehouse environments using dates (such as transaction or order date) and occasionally geography to partition large fact tables.
In SQL Server, this partitioning support (for single columns) is built in through the use of partitioning schemas and functions. For Hive, partitioning is also built into for both managed and external tables through the table definition as seen below.
CREATE TABLE customer (
PARTITION BY (
A couple of interesting things become apparent as you digest the sample above:
- Multiple ‘slicer’ or partitioning columns are supported (i.e. Region/Country)
- The partitioning columns do not need to be included in the table definition as the function both as the slicer and column. You will still be able to include them in the query projection.
Behind the scenes, the partitioning statement alters the way managed tables are stored. If you browse HDFS or whatever underlying file system you are using the normal file structure for a given database and table looks like .db/
or in the case of our example ERP.db/Customer. Data files are then written directly to the Customer directory.
When you define the table with partitions, the underlying structure is changed such that sub-directories are created for each slicer or partitioning column. If multiple slicers are specified, the sub-directories will be nested based on the order of of the columns in the definition statement. To provide a better visual picture, lets circle back to our Customer table with Region and Country partitions and see how the directories now enumerate:
Now when a Hive query is issued that contains either one or both of our partition slicers in the criteria or predicate (the WHERE clause) what effectively happens is partition elimination where only those data directories that are needed are scanned. In the event that the slicers are not used all directories will be scanned (i.e. a table scan) and the partitioning has no effect.
A few final important notes on partitioning:
- Be careful not to over partition your data within Hive. If your partitions are relatively small then the expense of recursing directories becomes more expensive than simply scanning the data. Likewise, partitions should be roughly similar in size to prevent a single long running thread from holding things up (see bucketing below).
- If your Hive table is extremely large and table scans (i.e. queries without partition slicers) need to be blocked you can put Hive into strict mode using the command below. In this mode, when a querying that would result in a table scan is issued an error is thrown back to the user.
- Partitioning works both managed and external tables.
- When loading partitioned data, you must explicitly or dynamically (dynamic partitions must be enabled, disabled by default) define the targeted partition.
Although you the term Bucketing may not be familiar to you, you are already familiar with the concept behind it. Much like partitioning, bucketing is a technique that allows you to cluster or segment large sets of data to optimize query performance. The syntax to create a table with bucketing is listed below:
CREATE TABLE order (
) PARTITIONED BY (company STRING)
CLUSTERED BY (username) INTO 25 BUCKETS;
Now that you’ve see it let’s talk about how it works. Unlike partitioning where each value for the slicer or partition key gets its own space, in clustering a hash is taken for the field value and then distributed across buckets. In the example above, we created 25 buckets and are clustering on username. Each bucket then would contain multiple users, while restricting each username to a single bucket.
There’s a danger with this technique though. Bucketing is entirely dependent on data correctly being loaded to the table. To properly load data to a table utilizing this technique you need to either:
- Set the maximum number of reducers to the same number of buckets specified in the table metadata (i.e. 25)
1set map.reduce.tasks = 25
- Use the following command to enforce bucketing:
1set hive.enforce.bucketing = true