Partitioning and Bucketing in Hive: Which and when?
Lately, I've been getting my feet wet with Apache Hive. Two of the more interesting features I've come across so far have been partitioning and bucketing. These are two different ways of physically grouping data together in order to speed up later processing. In this article, we'll go over what exactly these operations do, what the differences are, and what impact they can have. All of this will be in the context of using Spark to perform some calculations on the data.
Partitioning in Hive is conceptually very simple: We define one or more columns to partition the data on, and then for each unique combination of values in those columns, Hive will create a subdirectory to store the relevant data in. The effect is similar to what can be achieved through indexing (providing an easy way to locate rows with a particular (combination of) value(s)), with the added bonus of having the data co-located in the same file for even faster and more granular access. When using Spark, partitioning also provides an easy and efficient way to distribute data to worker nodes, since the partitions already form (presumably) logical groups. This should cut down on the amount of data that will have to be shuffled around between workers, and therefore improve performance.
Clearly, the effectiveness of partitioning is heavily dependent on how the partition column(s) will be used. You may want to consider partitioning based on columns that you will be regularly grouping your data by, since you are effectively doing that work in advance by partitioning. Partitioning on columns you won't be grouping by doesn't make much sense, and will probably hurt performance because Spark will need to shuffle more data around.
In addition, partitioning is most effective when the number of rows in each partition is roughly similar. Highly skewed partitions will certain workers to have far more data to process than others, which is an inefficient use of resources.
Bucketing is a very similar concept, with some important differences. Here, we split the data into a fixed number of "buckets", according to a hash function over some set of columns. (When using both partitioning and bucketing, each partition will be split into an equal number of buckets.) Hive will guarantee that all rows which have the same hash will end up in the same bucket, but a single bucket may contain multiple such groups.
So why does that matter? The key observation is that because the number of buckets is fixed (per partition), having a large number of distinct values in the "bucketing columns" is not a problem, unlike when partitioning would be used. In fact, it will only cause data to be spread more evenly among buckets. At the same time, related data (as defined by the hash function, presumably) will still be co-located within the same bucket, preserving the performance benefit.
Highly skewed data is still an issue, although that can be mitigated somewhat by reducing the number of buckets. For example, if half of the data results in the same hash, you might consider using only two buckets: One for that very common value, and one for everything else.
Why bother? Well, let's take a look. I reused the same public dataset from Amazon containing book reviews which I have previously used in other articles. I stored three copies of this data, and registered each of them in the Hive metastore. The first was stored as a 'plain' table, without any partitioning or bucketing, just like in the previous articles. The second copy was partitioned by the rating the review gave (1–5 stars), and the final one was additionally bucketed by the review date.
So for the second version, this clause gets added to the table definition:
PARTITIONED BY (star_rating STRING)
And for the third:
PARTITIONED BY (star_rating STRING) CLUSTERED BY (review_date) INTO 100 BUCKETS
Next, I wrote a tiny little Spark job that would perform an aggregate computation by grouping over the rating and review date for each of the three tables.
df.groupBy("star_rating","review_date") .avg("product_parent") .collect()
On average, it took my laptop 37 seconds of cpu-melting work to do this for the 'plain' data. With the data partitioned by rating, it only took 30 seconds. That's already a pretty decent performance gain. With the data that was both partitioned by rating and bucketed by date, the average time was down to only 8 seconds.
So, optimising the way the data is stored for this specific query cut the computation time by almost 80%. Of course, this is a rather idealised scenario, and only a single benchmark. Nevertheless, such significant gains are nothing to scoff at.
So which is best? This shouldn't come as a surprise: it depends. Do you expect to have a relatively low number of distinct values in a column you will be using a lot in filters, especially when there should be a roughly similar number of rows for each value? Partitioning would be the best choice. If, instead, there will be a lot of distinct values which might not be as evenly distributed, consider bucketing instead. As we have seen, using these tools in the right way can provide massive boosts to performance, so it's well worth taking some time to consider how best to leverage them.