Administering Hadoop and Hive for Tableau Connectivity
Tableau now supports the ability to visualize large, complex data stored in Cloudera, Hortonworks, MapR, and, starting in 9.0, Amazon EMR (Elastic MapReduce) distributions using Hive and the data source's ODBC driver. This article is a quick reference for developers, IT administrators, or advanced Tableau users who are comfortable configuring Hadoop software components. Carefully review the following sections outlined in this article so that you can use Hive effectively with Tableau.
- Prerequisites and external resources
- Set up Hive for use with Tableau
- Hive concepts for new admins
- Known limitations
Prerequisites and external resources
You must have one of the following distributions:
- Cloudera. Refer to the Cloudera Downloads page and click the CDH Download Now button for more information:http://www.cloudera.com/content/cloudera/en/downloads.html
- Hortonworks. Refer to the Hortonworks Downloads page for more information: http://hortonworks.com/download
- MapR Enterprise Edition (M5). Refer to the Download MapR Apache Hadoop M5 Enterprise Edition page for more information: http://www.mapr.com/products/mapr-editions/m5-edition
- Amazon EMR. Refer to the Amazon EMR page for more information: http://aws.amazon.com/elasticmapreduce
You may find the following external sources of information helpful when administering Hive for use with Tableau:
- Cloudera documentation: https://www.cloudera.com/content/cloudera/en/documentation.html
- Hortonworks documentation: http://docs.hortonworks.com
- MapR documentation: http://www.mapr.com/doc/display/MapR/Home
- Amazon EMR documentation: http://aws.amazon.com/documentation/elastic-mapreduce
- The Apache Hive Wiki provides a language manual and includes many important aspects of linking Hive to the data in your Hadoop cluster: https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Set up Hive for use with Tableau
Setting up Hive for use with Tableau is very straightforward because Tableau does not need to rely on complex aspects of supporting Hive for a long-term production environment. Before you begin, you must have Hive is installed on your cluster.
Starting the Hive service
Using a terminal interface to your Hadoop cluster, type the following command:
hive --service hiverserver
Move the Hive service to the background
The above command will terminate when you exit your Hadoop terminal session, so you may need to run the Hive service in a persisted state. To move the Hive service into the background, type the following command:
nohup HIVE_PORT=10000 hive --service hiveserver &
Note: For long-term usage, you will want to configure an automatic process for starting Hive along with the cluster itself.
Derby as the default metadata store
Hive metadata contains the structure and location of Hive tables. This information must be stored somewhere for persisted, read/write access. Hive uses Derby to hold the metadata information by default, which is an embedded, simple database.
While Derby cannot support concurrent use by multiple instances of Hive, the Hive service operates as the single point of access for external clients like Tableau. The Hive service supports concurrent access by multiple external clients while only operating on a single instance of the Derby metadata database. If you anticipate long-term production use of Hive, you may consider a multi-user metadata repository such as a PostgreSQL database. This will not impact the way Tableau interacts with Hive, and can be pursued independently of quickly getting your users started with Hive.
Hive concepts for new administrators
This section covers some aspects of Hive that will help you learn how Hive relates to Tableau.
SQL concepts in Hive: schemas, tables, and metadata
HiveQL is a partial implementation of the SQL standard that provides a familiar language for analysis and acts as an integration point for BI tools like Tableau. This includes a traditional way to view data sets as organized by a namespace (interchangeably called schemas or databases in Hive) and by tables. A namespace corresponds to a directory in a hadoop file system, and the tables are subdirectories under their respective namespace. Unless otherwise indicated, new tables are created in the "default" namespace in Hive. The tables may contain data blocks, or they may be linked to separate file system data as described in the next section.
A table in Hive operates as the boundary between how a data set should be parsed and processed, and how a data set should be presented as a typed, relational table. The relational presentation of data is covered by HiveQL syntax, which borrows from SQL a series of column names and data types. The metadata describing a table is stored in the metadata repository.
Linking Hive to file system data sets
If you have data in the hadoop file system but are just starting with Hive, it is very easy to pair the two. Hive has a powerful feature called external tables, which links to existing data in the file system without migrating the data into the Hive table structure. You can rewrite the table definition to accommodate an evolving schema as new data is added to the external location. When paired with complete control over how the data should be parsed and processed, this strategy offers the full flexibility of Hadoop's schema-last benefits, while still imposing structure on the data so it can be used with tools like Tableau.
Refer to the Hive Language Manual on DDL, for more information about the
CREATE EXTERNAL TABLE .. LOCATION hfds_path
syntax: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLExtensibility via UDFs, UDAFs, and custom Map/Reduce
Hive enables users to work with user-defined functions (UDFs) for scalar data and user-defined aggregate functions (UDAFs) for aggregate data similar to other SQL data management systems. These custom computations can be implemented as scripts or as Java code compiled into JAR files. Refer to the Extra Capabilities for Hadoop Hive article for more information about how analysts may use these directly within tables.
The Map/Reduce underpinnings of Hadoop are also available through Hive, and are described in the Hive Language Manual on Transform and Map/Reduce: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
Known limitations
The following list contains known issues with Hive. These limitations and suggested workarounds are also discussed from the Tableau user perspective in the Connecting to Hadoop Hive article.
Authentication
The Hive ODBC driver does not offer the means to pass credentials through to the Hive service to authenticate the user for the connection. Refer to the Connecting to Hadoop Hive article to understand how Tableau offers data security management despite this limitation.
Query optimizations
While Hive has a number of sophisticated query optimizations, most are turned off by default. This is largely because the optimization can degrade functionality or performance in certain circumstances, and should only be turned on if they are applicable to a given task. The Tableau feature called Initial SQL will execute a series of user-provided statements at the beginning of a connection, so your users can set Hive configuration variables to enable optimizations as needed.
Query cancellation and progress
The Hive service does not relay information about query progress to Tableau, nor does it provide an interface for canceling queries. While your users can choose to abandon a query in order to accomplish urgent tasks (such as saving their work), the query will continue to execute until it is complete on the Hadoop cluster. Although this feature gap may be addressed by the Hadoop community in the future, your users may need an interface for canceling the queries that they initiated.