How to set the file numbers of hive table using insert command

Here are some articles to show how to set the file numbers of hive table using insert method:

 

How to control the file numbers of hive table after inserting data on MapR-FS.

 http://www.openkb.info/2014/12/how-to-control-file-numbers-of-hive.html
Hive table contains files in HDFS, if one table or one partition has too many small files, the HiveQL performance may be impacted.
Sometimes, it may take lots of time to prepare a MapReduce job before submitting it, since Hive needs to get the metadata from each file.
This article explains how to control the file numbers of hive table after inserting data on MapRFS; or simply saying, it explains how many files will be generated for “target” table by below HiveQL:
Above HiveQL may have below 2 major steps:

1. MapReduce(In this example, Map only) job to read the data from “source” table.

The number of Mappers determines the number of intermediate files, and the number of Mappers is determined by below 3 factors:

a. hive.input.format

Different input formats may start different number of Mappers in this step.
Default value in Hive 0.13 is org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.
It will combine all files together and then try to split, so that it can improve the performance if the table has too many small files.
One old default value is org.apache.hadoop.hive.ql.io.HiveInputFormat which will split each file separately. Eg, If you have 10 small files and each file only has 1 row, Hive may spawn 10 mappers to read the whole table.
This article is using the default CombineHiveInputFormat as example.

b. File split size

mapred.max.split.size and mapred.min.split.size control the “target” file split size.
(In latest Hadoop 2.x, those 2 parameters are deprecated. New ones are mapreduce.input.fileinputformat.split.maxsize and mapreduce.input.fileinputformat.split.minsize).
For example, if one Hive table have one 1GB file, and the target split size is set to 100MB, 10 mappers MAY be spawned in this step. The reason of “MAY” is because of below factor c.

c. MapR-FS chunk size

Files in MapR-FS are split into chunks (similar to Hadoop blocks) that are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunk size.
The actual split size is max(target split size, chunk size).
Take above 1GB file with 100MB “target” split size for example, if the chunk size is 200MB, then the actual split size is 200MB, 5 mappers spawned; if the chunk size is 50MB, then the actual split size is 100MB, 10 mappers spawned.

Lab time:

Imagine here we have prepared 3 hive tables with the same size — 644MB and only 1 file for each table.
The only difference is the chunk size of the 3 hive tables.
source  — chunk size=8GB.
source2 — chunk size=256MB(default in mfs).
source3 — chunk size=64k(Minimum).

Then the question is how many mappers will be spawned for below INSERT, after setting target split size to 100MB?

Results:
1.  Table “source”
The whole table 644MB is in 1 chunk(8G each), so only 1 mapper.
2. Table “source2”
The whole table 644MB is in 3 chunks(256MB each), so 3 mappers.
3. Table “source3”
The whole table 644MB is in more than 10000 chunks(64KB each), and target split size(100MB) is larger than each chunk size 100MB, so 7 mappers.

Thinking accordingly, if the target split size is 10GB, for all above 3 tables, only 1 mapper will be spawned in the 1st step; If the target split size is 1MB, the mappers counts are : source(1), source2(3), source3(645).
After figuring out the 1st step, let’s move to the 2nd step.

2. Small file merge MapReduce job

After the 1st MapReduce job finishes, Hive will decide if it needs to start another MapReduce job to merge the intermediate files. If small file merge is disabled, the number of target table files is the same as the number of mappers from 1st MapReduce job. Below 4 parameters determine if and how Hive does small file merge.

  • hive.merge.mapfiles — Merge small files at the end of a map-only job.
  • hive.merge.mapredfiles — Merge small files at the end of a map-reduce job.
  • hive.merge.size.per.task — Size of merged files at the end of the job.
  • hive.merge.smallfiles.avgsize — When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

By default hive.merge.smallfiles.avgsize=16000000 and hive.merge.size.per.task=256000000, so if the average file size is about 17MB, the merge job will not be triggered. Sometimes if we really want only 1 file being generated in the end, we need to increase hive.merge.smallfiles.avgsize to large enough to trigger the merge; and also you need to increase hive.merge.size.per.task to the get the needed number of files in the end.

Quiz time:

In hive 0.13 on MapR-FS with all default configurations, how many files will be generated in the end for below HiveQL? Have a guess on the file size in the end?
Reminder: chunk size for “source3” is 64KB.
1.

2.

Answers:
1. Target split size is 65MB and chunk size is only 64KB, so 1st job will spawn 10 mappers and each mapper will generate one 65MB intermediate file.
Merge job will be triggered because average file size from previous job is less than 100MB(hive.merge.smallfiles.avgsize).
For each task, to achieve file size 200MB(hive.merge.size.per.task), 4 x 65MB files will be merged into one 260MB file.
So in the end, 3 files will be generated for target table — 644MB = 260MB+260MB+124MB.

2. Target split size is 65MB and chunk size is only 64KB, so 1st job will spawn 10 mappers and each mapper will generate one 65MB intermediate file.
Merge job will be triggered because average file size from previous job is less than 270MB(hive.merge.smallfiles.avgsize).
For each task, to achieve file size 200MB(hive.merge.size.per.task), 4 x 65MB files *should* be merged into one 260MB file. However if so, the average file size is still less than 270MB(hive.merge.smallfiles.avgsize), so they are still considered as “small files”.
In this case, 5 x 65MB files are merged into one 325MB file.
So in the end, 2 files will be generated for table table — 644MB = 325MB+319MB.

Key takeaways:

1.  MapR-FS chunk size and target split size determine the number of mappers and the number of intermediate files.
2. Small file merge job controls the final number of files for target table.
3. Too many small files for one table may introduce job performance overhead.

 
Here are some configurations to control the final number of files for hive. 
hive.optimize.sort.dynamic.partition
  • Default Value: true in Hive 0.13.0 and 0.13.1; false in Hive 0.14.0 and later (HIVE-8151)
  • Added In: Hive 0.13.0 with HIVE-6455

When enabled, dynamic partitioning column will be globally sorted. This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.

hive.merge.smallfiles.avgsize
  • Default Value: 16000000
  • Added In: Hive 0.5.0

When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

hive.exec.max.created.files
  • Default Value: 100000
  • Added In: Hive 0.7.0

Maximum number of HDFS files created by all mappers/reducers in a MapReduce job.

hive.merge.size.per.task
  • Default Value: 256000000
  • Added In: Hive 0.4.0

Size of merged files at the end of the job.

 

hive.merge.smallfiles.avgsize
  • Default Value: 16000000
  • Added In: Hive 0.5.0

When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

hive.merge.mapredfiles
  • Default Value: false
  • Added In: Hive 0.4.0

Merge small files at the end of a map-reduce job.

Reference:

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties