SQL on Hadoop The Differences and Making the Right Choice

SQL on hadoop grey

SQL on Hadoop : The Differences and Making  the Right Choice

Whether you are on Hadoop now or considering making the leap, this article will help you understand the various SQL on Hadoop tools, how they compare, and how they stack up against each other.  The powerful (and ever growing) ecosystem of Hadoop is, and has been, a clear leader in operational and exploratory data computation and analytics – especially when that data is Big Data. A clear and growing need has arisen out of this — that being, how to capitalize on the well known and expressive legacy of SQL and bring that into the new and modern Hadoop approach to data. This has been an evolving initiative over the last six years, originally beginning with Apache Hive and expanding outwards into different vendors: the combination of SQL and increased performance.

The strategies for working with Big Data are as numerous as the ways that data can manifest itself, and often times that is pretty specific to who is collecting the data, how it is stored, how it grows, and how it can be consumed and explored.

When approaching “SQL on Hadoop”, a lot of development has occurred in the last five or so years. There are really two perhaps distinct, though not exclusive camps: technologies that improve on Hive and those that can potentially augment or replace it. What follows in this post is a high level overview of three promising solutions of each variety — touching on the major questions that often come up when evaluating new solutions:

  1. What is the burden for onboarding and installation/upgrade?
  2. What does the architecture look like, in brief?
  3. What kind of performance boost can be expected from Hadoop native?
  4. What key features should I know about?
  5. What does the simple, great-fit use case look like?

Building on Hive

Thanks to the open back-end of Hive, a lot of solutions can simple capitalize on the HiveSQL query language by improving the execution engine and performance.

Hive on Apache Tez

Many of us are familiar with Hive, the original SQL solution for Hadoop. It’s simplistic, easy to use, and provided that much desired expression of SQL against Hadoop data. One of the more common complaints against Hive was that it is not highly performant — enter Tez. Tez is a developer API and framework allowing for native YARN applications to express computations as highly optimized and performant dataflow graphs. When embedded in Hive, the Hive + Tez combination gives us the well known Hive SQL expressive query language with the powerful Tez execution engine.

Installation / Onboarding
  • Minimal to moderate.
  • In existing Hive implementations, it’s merely a matter of enabling Tez in supported distributions, otherwise upgrade could be required.
  • The architecture of Hive on Tez vs that of Hive native, is mainly just the wholesale trade of expressing HiveSQL as distinct MapReduce jobs for that of a DAG (Directed Acyclical Graph) workflow, thus allowing for higher levels of parallelization, data routing, and DAG component reuse within and between computational executions.
  • Thanks to the alternative DAG workflow and other integral execution improvements, significant performance increases can be expected with Hive on Tez. Comparisons against equivalents such as Spark are conflicting as to which is more highly performing, though Tez is a more established solution to Hive execution at the current moment.
Key Features
  • Full Hive compatibility.
  • Built on YARN for resource management
  • Access to optimum Tez API for custom, non-SQL tasks
  • Can augment or replace existing Hadoop, Hive, and Pig systems
Use Case
  • New or migratable Hive systems where maintaining the use of Hive SQL is desired with considerable performance improvements
  • Additionally, existing and busy Hive systems where additional performance boosts and effective resource management benefits are desirable.

Hive on Apache Spark

The combination of SparkSQL and Apache Spark has become a growing interest and standard in the realm of distributed computing as a fast in-memory processing engine. Spark, and likewise SparkSQL, also maintain full compatibility with Hive. By installing Spark alongside Hive, you can gain access to all available Hive data while being able to reuse Hive queries and UDFs, executing them within the Spark execution engine. As noted before, one of the early complaints to a basic Hive setup was performance — Spark has demonstrable performance benefits allowing an improved Hive infrastructure augmented by the idiosyncratic and powerful computational paradigm which only Spark can give.

Installation / Onboarding
  • Moderate.
  • Installation is straight-forward, particularly alongside existing Hive and/or YARN based infrastructures.
  • Onboarding can be minimal at first (especially with Hive), though eventually mandating more in depth understanding of Spark execution and development APIs
  • Similar to the Hive on Tez solution, the Hive on Spark solution is, simplistically, a wholesale trade of the Hive native execution engine for that of Spark. Unlike Tez, however, Spark is not specifically tied to any component within the Hadoop ecosystem and can run all on its own. Hive SQL is translated into Spark jobs by way of a Hive specific SparkContext which coordinates fragment execution across independent worker nodes distributed in the cluster. In contrast to Hive native, Spark computation is mostly (or exclusively) in memory for significantly higher throughput.
  • As noted in the architecture, the advantage of the Spark execution engine is most pointedly demonstrated by the use of in memory computation whenever possible. Significant performance increases from Hive native can be expected. Comparisons between closest alternative, Hive on Tez, are conflicting as to which would be faster.
Key Features
  • Hive SQL compliant and easy reuse of Hive queries and UDFs
  • UDF support
  • SparkSQL alternative and/or extension
  • Multi-language API support (java, python, scala, etc)
  • High memory, low disk consumption
  • YARN compatible
  • MADlib integrations
Use Case
  • Existing or new Hive installations where considerable performance increases are desirable or where memory-mostly data computation is advantageous to disk-mostly data.
  • Additionally, infrastructures where combinations of Hive specific and Spark specific jobs will be advantageous.

Apache Impala (incubating)

Impala is an open source expansion of Hive SQL on top of the Hadoop system with particular emphasis on optimized data-local execution and concurrent , multi-user performance. Though not technically a back-end replacement for Hive, if you’re already using a Hive-powered system, making the switch to Impala is very streamlined and produces fairly immediate performance boons, especially in highly concurrent, multi-user systems. Impala supports querying data within HDFS, HBase, and other widely used file formats such as Avro, Parquet, RCFile, etc.

Installation / Onboarding
  • Minimal to moderate.
  • Does not necessitate additional services, though requires additional services: impalad (one per Hadoop datanode), catalogd and statestored (one per cluster).
  • Onboarding is relatively moderate requiring fluency in Hive SQL and Impala installation and management standards.
  • Impala, tightly integrated with HiveSQL, accepts queries via a JDBC or ODBC driver, and submits them to a daemon within the cluster for query planning and cost-based optimization against cached table statistics in a unified metadata catalogue and statestore. Query fragments are then distributed to daemons alongside HDFS and HBase data for local and native execution, circumventing the need for MapReduce and ETL.
  • Given the way that Impala will sidestep normal MapReduce and ETL for local and native scanning, claimed to be “near hardware speed” significant improvements to performance can be expected. Based on the comparisons reviewed with Drill, performance is fairly comparable.
Key Features
  • Hive SQL equivalent
  • Supports adding data by moving underlying files, INSERT … SELECT, or with LOAD DATA
  • Full transactional support is on the roadmap
  • UDFs (Java and C++) and UDAs (C++)
Use Case
  • Existing and/or new Hadoop specific systems where data is predominantly exploratory (read-mostly), requires high multi-user and concurrent performance, and desires high performance benefits with minimal installation or learning curve burdens.
  • Additionally, for existing Hive installations.

Augmenting or Replacing Hive

For implementations that may not need the full breadth of Hive functionality, or that may need to extend beyond it, there has also been some great work done for these use-cases.

Apache Phoenix

Apache Phoenix, initially begun with members of Salesforce and graduating to a full Apache project, is a lightweight Relational SQL Database layer over the HBase data store. If your data is already exclusively in HBase, there is almost no reason not to make the switch to using Phoenix, particularly if leveraging the expressive SQL syntax is an end goal. Due to the architecture of Phoenix, you can also expect an immediate performance boon over strict use of HBase native.

Installation / Onboarding
  • Minimal with no requirement for additional servers or services.
  • Prepackaged with HDP 2.1+ and Amazon EMR, otherwise Phoenix jar just needs to be in HBase lib and client classpath
  • Onboarding requires familiarity with Phoenix supported SQL
  • Phoenix, using a client-side parallelization and embedded JDBC Driver, parses, optimizes, fragments and distributes a query (and subsequently its query plan) as a series of low-level HBase scans locally to each node, thus abstracting much of HBase’s complexity and optimizing performance.
  • Due to the way that Phoenix will fragment and scan against HBase locally, performance increases can be expected when compared to HBase native as well as significant decreases in network bottlenecking.
Key Features
  • Follows ANSI SQL standards whenever possible.
  • Full read/write transactional support.
  • Supports composite keys and secondary indices.
  • Integrates with Apache Pig, Flume, Sqoop, Hive, and SparkSQL
Use Case
  • Existing and/or new HBase exclusive data where desire is for implementation, installation, and upgrade burden to be minimal as well as desired performance boons as compared to native HBase.
  • Additionally, where usage of Hadoop is operational. i.e. need for insert, update, and transactional support.

Apache HAWQ (incubating)

Apache HAWQ (incubating), recently brought into the Apache Foundation from the well-established Pivotal HAWQ project, is a parallel SQL query engine built on top of the HDFS, claiming to be the ‘worlds fastest SQL engine on Hadoop’. Additionally, HAWQ can run atop Pivotal’s HD and Hortonworks HDP, and by leveraging the Pivotal Extension Framework (PXF), HAWQ can be extended to integrate with data in HBase, Hive tables, Text, Avro, and Parquet data sources, as well as an exposed plugin API to query other parallel data stores or engines. The flexibility and expression of the highly SQL compliant HAWQ SQL execution engine coupled with the power of PXF seems to really push HAWQ HD to a total Hadoop SQL solution. Performance in the HAWQ approach is paramount, and a large amount of effort was put into designing an exceptionally performing distributed system.

Installation / Onboarding
  • Moderate to intensive.
  • Apache HAWQ will run atop existing Hadoop, Pivotal HD and Hortonworks HDP implementations and can also be managed by Ambari.
  • Requires additional configuration as HAWQ maintains its own master and metadata replication systems
  • Onboarding requires knowledge of HAWQ installation, setup, and management as well as familiarity with HAWQ supported SQL and PXF usage and configuration
  • HAWQ has one of the more elaborate architectures on this list. Queries are executed via a jdbc, odbc, or psql client and are sent to a HAWQ master. There they are parsed, optimized (including cost-based and resource-based), fragmented and dispatched to HAWQ segments across the nodes (or subset of nodes) for execution — all with tight integrations with YARN and its own fine-grained resource managements for overall performance and cluster stability and engrained fault tolerance systems.
  • Due to the high emphasis on performance in HAWQ’s design and a years of evolution as a Pivotal HD component, significant performance increases can be expected. Comparisons done by the Pivotal team show HAWQ well out-performing Impala and Hive.
Key Features
  • Full ANSI SQL compliance
  • Full read/write transactional support
  • JDBC/ODBC and client interfacing for most programming languages
  • High level of engrained fault-tolerance systems
  • MADlib native integrations
  • PXF extensions
Use Case
  • Advanced existing and/or new Hadoop (esp. Pivotal HD and Hortonworks HDP) mostly data where HDFS and HBase could need to be queried together, optimum performance is ideal, and extensibility into some external data sources is needed or anticipated.
  • Additionally, where usage of Hadoop is operational. i.e. need for insert, update, and transactional support.

Apache Drill

Apache Drill, inspired by Google’s Dremel (BigQuery), is a well known, distributed, data-source agnostic query engine capable of being overlaid on top of a myriad of structured, semi-structured, and nested data. While not exclusively tied to the Hadoop ecosystem, Drill provides a robust, low-latency SQL interface for accessing your data — the agnosticism and extensibility of its design giving the added bonus of spanning beyond Hadoop distributions into datastores such as MongoDB, Amazon S3, Google Cloud Storage, Azure Blog Storage, Swift and others via custom plugins. If you’re looking for a universal solution to access a wide variety of data in exclusive stores, Drill will provide the flexibility you need.

Installation / Onboarding
  • Moderate to intensive.
  • Can run in embedded modes or alongside existing Hadoop infrastructures. The configuration requires zookeeper connection details.
  • Complexity of installation and onboarding is proportional to the scope of it’s use — i.e. Hadoop only integration vs cross-platform integrations.
  • Unlike the others on this list, Drill is truly platform agnostic. However, that agnosticism can be expressed well in the Hadoop environment. Queries are accepted via a client (JDBC, ODBC, Command Line, or REST API) and delegated to a “drilling” Drillbit. The Drillbit then parses, plans, optimizes the query, fragmenting its execution components to individual nodes based on locality — this accomplished via a storage engine interface specific to the storage format or platform of the data at hand. The results are then streamed back to the client.
  • Due to the idiosynchratic execution design and the optimization of Hadoop storage interfaces, significant improvements to performance when compared to Hadoop native or comparable to improved performance to similar solutions such as Impala.
Key Features
  • Datasource agnostic and dynamic schema discovery
  • Open ended storage interfacing
  • ANSI SQL compliance
  • UDFs and UDAs
  • Support for structured, semi-structured, and nested data
  • No requirement for centralized metadata
Use Case
  • Advanced existing or new Hadoop installations where a high level of extensibility, “plugability”, flexibility of data format, and overall performance are desired
  • Additionally, data systems that may already extend beyond or could foreseeably grow beyond an exclusively Hadoop infrastructure.

Making the Choice

The right solution for your Hadoop SQL, Big Data needs is out there. The trick comes in in choosing the right solution for your current infrastructure, the application and the data. If the current infrastructure is built on Hive, and the main goal is performance, then Hive on Tez or Hive on Spark is going to be a great way to go. If the current infrastructure is not on Hive, or needs less than or greater than the functionality that Hive provides, then Phoenix is a great and simple solution, likewise HAWQ and Drill will be excellent fix for more advanced systems. Or, finally, if your use case is more custom or you need help deciding consider calling on an objective consultancy like AgilData that can determine the best execution venue for your needs.

Big Data Services

Need Help?

Contact us if you need help choosing or implementing the best solution.

    (check all that apply)


Share This Post

Welcome to the AgilData Blog!

Find the latest news about AgilData, Big Data and tips and tricks straight from our engineering team.