• Best articles to learn Hive

    Hive Architectural Overview

    SQL queries are submitted to Hive and they are executed as follows:

    1. Hive compiles the query.

    2. An execution engine, such as Tez or MapReduce, executes the compiled query.

    3. The resource manager, YARN, allocates resources for applications across the cluster.

    4. The data that the query acts upon resides in HDFS (Hadoop Distributed File System). Supported data formats are ORC, AVRO, Parquet, and text.

    5. Query results are then returned over a JDBC/ODBC connection.

    A simplified view of this process is shown in the following figure.

    [Read More...]
  • Best resources to learn Hive partition


    • Static Partition (SP) columns: in DML/DDL involving multiple partitioning columns, the columns whose values are known at COMPILE TIME (given by user).
    • Dynamic Partition (DP) columns: columns whose values are only known at EXECUTION TIME.


    DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns.

    In INSERT …

    [Read More...]
  • How to set the file numbers of hive table using insert command

    Here are some articles to show how to set the file numbers of hive table using insert method:


    How to control the file numbers of hive table after inserting data on MapR-FS.
    Hive table contains files in HDFS, if one table or one partition has too many small files, the HiveQL performance may be impacted.
    Sometimes, it may take lots of time to prepare a MapReduce job before submitting it, since Hive needs to get the metadata from each file.
    This article explains how to control the file numbers of hive table after inserting data on MapRFS;

    [Read More...]
  • Hive partitioning vs Bucketing

    Hive Bucketing and Partitioning

    To better understand how partitioning and bucketing works, please take a look at how data is stored in hive. Let’s say you have a table

    1. CREATE TABLE mytable (
    2. name string,
    3. city string,
    4. employee_id int )
    6. CLUSTERED BY (employee_id) INTO 256 BUCKETS

    You insert some data into a partition for 2015-12-02. Hive will then store data in a directory hierarchy, such as:

    1. /user/hive/warehouse/mytable/y=2015/m=12/d=02

    As such, it is important to be careful when partitioning.

    [Read More...]
  • Move Hive Table from One Cluster to Another

     This tutorial uses examples to describe how to move Hive table from one cluster to another.  The basic idea is to use the EXPORT and IMPORT commands. 

    The EXPORT command exports the data of a table or partition, along with the metadata, into a specified output location. This output location can then be moved over to a different Hadoop or Hive instance and imported from there with the IMPORT command.

    Export Syntax

    EXPORT TABLE tablename [PARTITION (part_column=”value”[, …])]
      TO ‘export_target_path’ [ FOR replication(‘eventid’) ]

    Import Syntax

    IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column=”value”[, …])]]
      FROM ‘source_path’
      [LOCATION ‘import_target_path’]

    Examples to Move Hive Table from one cluster (grid) to another

    Suppose you have two clusters : cluster A and cluster B. 

    [Read More...]
  • Append to a Hive partition from Pig

    When we use Hive, we can append data to the table easily, but when we use Pig (i.e., the HCatalog ) to insert data into Hive table, we are not allowed to append data to a partition if that partition already contains data. 

    In this post, I describe a method that can help you append data to the existing partition using a dummy partition named run. It means  the run number you append some data to this partition. 

    For example, we create the following partitioned hive table:

    Then pig script looks like the following: 

    Now we can run the pig script using the following command:

    Then we have the following content in the table:

    Each time when you want to append data to the partition DATE=20160605,

    [Read More...]
  • Set variable for hive script

    When we run hive scripts, such as Load data into Hive table, we often need to pass parameters to the hive scripts by defining our own variables. 

    Here are some examples to show how to pass parameters or user defined variables to hive. 

    Use hiveconf for variable subsititution

    For example, you can define a variable DATE, then use it as ${hiveconf:DATE}

    you can even pass the variable from command line:

    Use env and system variables

    You can also use env and system variables like this  ${env:USER}

    You can run the following command to see all the available variables:

    If you are o the hive prompt,

    [Read More...]
  • An Example to Create a Partitioned Hive Table

    Partition is a very useful feature of Hive. Without partition, it is hard to reuse the Hive Table if you use HCatalog to store data to Hive table using Apache Pig, as you will get exceptions when you insert data to a non-partitioned Hive Table that is not empty

     In this post, I use an example to show how to create a partitioned table, and populate data into it. 

    Let’s suppose you have a dataset for user impressions. For instance, a sample of the data set might be like this:






    If you use Pig to analyze the data,

    [Read More...]
  • Exceptions When Delete rows from Hive Table

    It’s straight forward to delete data from a traditional Relational table using SQL. However, delete rows from Hive Rows can cause several exceptions.

    For examples, let see we have a imps_part table,  we want to delete the values in the Table.  You will get the exception:

    When you run the simple delete command, we get: FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations


    Some one suggest to use the following command:

    This will result in the following exception:
    FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned.

    [Read More...]
  • Save data to Hive table Using Apache Pig

    We have described how to load data from Hive Table using Apache Pig, in this post, I will use an example to show how to save data to Hive table using Pig.

    Before save data to Hive, you need to first create a Hive Table. Please refer to this post on how to create a Hive table

    Suppose we use Apache Pig to Load some data from a text file, then we can save the data to the hive table using the following script. 

    The store_student.pig script is like this:

    Note: You must specify the table name in single quotes: STORE data into ‘tablename’.

    [Read More...]
Page 1 of 212