SQL, Hive Hbase and Impala

What is Impala?

Impala is an open source massively parallel processing query engine on top of clustered systems like Apache Hadoop. It was created based on Google’s Dremel paper.  It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS). Impala uses HDFS as its underlying storage.

It integrates with HIVE metastore to share the table information between both the components. Impala makes use of existing Apache Hive (Initiated by Facebook and open sourced to Apache) that many Hadoop users already have in place to perform batch oriented , long-running jobs in form of SQL queries. Impala main goal is to make SQL-on Hadoop operations fast and efficient to appeal to new categories of users and open up Hadoop to new types of use cases. Impala – HIVE integration gives an advantage to use either HIVE or Impala for processing or to create tables under single shared file system HDFS without any changes in the table definition. However not all SQL-queries are supported by Impala, there could be few syntactical changes. ImpalaQL is a subset of HiveQL, with some functional limitations like transforms. Impala can also query HBase tables. Impala is also distributed among the cluster like Hadoop. It has different types of daemons running on specific hosts of cluster like Impala daemon, Statestore and Catalog Services, which we will discuss in the coming sections.


Relational Databases and Impala

Impala uses a Query language that is similar to SQL and HiveQL. The following table describes some of the key dfferences between SQL and Impala Query language.

Impala Relational databases
Impala uses an SQL like query language that is similar to HiveQL. Relational databases use SQL language.
In Impala, you cannot update or delete individual records. In relational databases, it is possible to update or delete individual records.
Impala does not support transactions. Relational databases support transactions.
Impala does not support indexing. Relational databases support indexing.
Impala stores and manages large amounts of data (petabytes). Relational databases handle smaller amounts of data (terabytes) when compared to Impala.


Hive, Hbase, and Impala

Though Cloudera Impala uses the same query language, metastore, and the user interface as Hive, it differs with Hive and HBase in certain aspects. The following table presents a comparative analysis among HBase, Hive, and Impala.

HBase Hive Impala
HBase is wide-column store database based on Apache Hadoop. It uses the concepts of BigTable. Hive is a data warehouse software. Using this, we can access and manage large distributed datasets, built on Hadoop. Impala is a tool to manage, analyze data that is stored on Hadoop.
The data model of HBase is wide column store. Hive follows Relational model. Impala follows Relational model.
HBase is developed using Java language. Hive is developed using Java language. Impala is developed using C++.
The data model of HBase is schema-free. The data model of Hive is Schema-based. The data model of Impala is Schema-based.
HBase provides Java, RESTful and, Thrift API’s. Hive provides JDBC, ODBC, Thrift API’s. Impala provides JDBC and ODBC API’s.
Supports programming languages like C, C#, C++, Groovy, Java PHP, Python, and Scala. Supports programming languages like C++, Java, PHP, and Python. Impala supports all languages supporting JDBC/ODBC.
HBase provides support for triggers. Hive does not provide any support for triggers. Impala does not provide any support for triggers.

All these three databases −

  • Are NOSQL databases.

  • Available as open source.

  • Support server-side scripting.

  • Follow ACID properties like Durability and Concurrency.

  • Use sharding for partitioning.

Difference between Hive and Impala – Impala vs Hive

Impala has been shown to have performance lead over Hive by benchmarks of both Cloudera (Impala’s vendor) and AMPLab. Benchmarks have been observed to be notorious about biasing due to minor software tricks and hardware settings. However, it is worthwhile to take a deeper look at this constantly observed difference. The following reasons come to the fore as possible causes:

  1. Cloudera Impala being a native query language, avoids startup overhead which is commonly seen in MapReduce/Tez based jobs (MapReduce programs take time before all nodes are running at full capacity). In Hive, every query has this problem of “cold start” whereas Impala daemon processes are started at boot time itself, always being ready to process a query.
  2. Hadoop reuses JVM instances to reduce startup overhead partially but introduces another problem when large haps are in use. Cloudera benchmark have 384 GB memory which is a big challenge for the garbage collector of the reused JVM instances.
  3. MapReduce materializes all intermediate results, which enables better scalability and fault tolerance (while slowing down data processing). Impala streams intermediate results between executors (trading off scalability).
  4. Hive generates query expressions at compile time whereas Impala does runtime code generation for “big loops”.
  5. Apache Hive might not be ideal for interactive computing whereas Impala is meant for interactive computing.

  6. Hive is batch based Hadoop MapReduce whereas Impala is more like MPP database.

  7. Hive supports complex types but Impala does not.

  8. Apache Hive is fault tolerant whereas Impala does not support fault tolerance. When a hive query is run and if the DataNode goes down while the query is being executed, the output of the query will be produced as Hive is fault tolerant. However, that is not the case with Impala. If a query execution fails in Impala it has to be started all over again.

  9. Hive transforms SQL queries into Apache Spark or Apache Hadoop jobs making it a good choice for long running ETL jobs for which it is desirable to have fault tolerance, because developers do not want to re-run a long running job after executing it for several hours.

Drawbacks of Impala

Some of the drawbacks of using Impala are as follows −

  • Impala does not provide any support for Serialization and Deserialization.
  • Impala can only read text files, not custom binary files.
  • Whenever new records/files are added to the data directory in HDFS, the table needs to be refreshed.


Basic Commands and Syntax Imapala

Impala provides command line interface, which gives interactive query results. Impala supports all the DML operations and DDL (No UPDATES). Any change in Impala will be reflected back to Hive metastore immediately with the help of catalogd. But this is not the same if any change in hive metastore directly. This means if there is any change in hive metastore or hdfs file system, there should be manual command “invalidate metadata” needs to be executed.

Some of the basic command syntax has been given below:

  • Impala supported Datatypes

It supports all the numeric datatypes and character datatypes and date datatypes.

  • Impala table queries

The above statement creates a table called “DEZYRE_USER_INFO” with five columns in it and the underlying file format is specified as PARQUET.

The above select clause will give the sample 10 records from Impala table.

The above query will give you the total number of topics(subjectts) followed or accessed for each user (group by clause) for the given period (filter condition).

The above query will give the number of days user spent on each topic.

The above query gives the total number of times user accessed dezyre server and in the descending order that means will get top accessed users on top.

Impala has an in-built support for all the aggregation functions like SUM, AVG, MIN, MAX, COUNT. It gives a flexibility of changing the datatype runtime using CAST function. It has conditional functions to decide based on the values dynamically. And all the JOIN operations can be performed (but Joining large tables is not suggested as Impala performs operations in memory).

Impala also support VIEWS that act as variables or aliases for queries defined. If there is a complicated query which needs to run multiple times then a view can be created and can query on the created query rather than executing the complicated query each time.


Create view Query 1 as,