Hadoop – Hive – What is the Version # of Hive Service and Clients that you are running?

Introduction

Hadoop is a speeding bullet.  You look online, Google for things, try it out, and sometimes you hit, but often you miss.

What do I mean by that?

Well this evening I was trying to play with Hive; specifically using Sqoop to import a table from MS SQL Server into Hive.

A bit of background, my MS SQL Server table has a couple of columns declared as datetime.

Upon running the Sqoop statement pasted below:



--connect "jdbc:sqlserver://sqlServerLab;database=DEMO" \
--username "dadeniji" \
--password "l1c0na" \
--driver "com.microsoft.sqlserver.jdbc.SQLServerDriver" \
-m 1 \
--hive-import \
--hive-table "customer" \
--table "dbo.customer" \
--split-by "customerID"

 

The above command basically gives the following instruction set:

  • Via JDBC Driver (jdbc:sqlserver) connect to SQL Instance (sqlServerLab) and database Demo
  • Use the following SQL Server credentials — username – dadeniji, password – l1c0na
  • JDBC Driver’s Class name – com.microsoft.sqlserver.jdbc.SQLServerDriver
  • Number of Map Reduce Jobs (m 1)
  • Sqoop Operation — hive-import
  • Hive Table — customer
  • SQL Server Table — dbo.customer
  • Split-by — customerID

I noticed in the Sqoop console log output statements a couple of warnings:



INFO manager.SqlManager: Executing SQL statement: 
SELECT t.* FROM dbo.customer AS t WHERE 1=0

WARN hive.TableDefWriter: Column InsertTime had to be cast to a less 
precise type in Hive

WARN hive.TableDefWriter: Column salesDate had to be cast to a less 
precise type in Hive

Processing

Explore MS SQL Server

So I quick went back and looked at my SQL Server Table:

  use [Demo];
  exec sp_help 'dbo.customer';

Output:

Hadoop - Sqoop - MS SQL Server - dbo.customer

The output is congruent with my thoughts:

  • The InsertTime is a datetime column
  • The salesDate is a datetime column

Explore Hive

Launch Hive:

In shell, issue “hive” to initiate Hive Shell:


hive

List all tables:

To confirm that a corresponding table has been created in Hive, uses list


show tables;

Output:

Hadoop - Sqoop - Client - Show tables

Display Table Structure (customer):

Display table structure using describe:


Syntax:
    describe <table-name>;

Sample:

    describe customer;

Output:

Hadoop - Sqoop - Client - Describe -- customer

Explanation:

  • So it is obvious that our two original MS SQL Server Date columns (Inserttime and salesdate) were not brought in as Datetime, but String

So I am thinking why?

Hive Datatype Support

I know that the Timestamp column was not one of the original datatypes supported by Hive.  It was added per Hive version 0.8.0

This is noted in:

HortonWorks – Hive – Language Manual – Datatypes
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/language_manual/datatypes.html

Determine Hive Version

There are a couple of ways to get the Hive’s Server and Client Version Number

Determine Hive Version – Command Shell – Using ps

issue ps -aux



ps -aux | grep -i "Hive"

Output (Screen shot):

Hadoop - Hive - Version -- ps --aux

Output (Text):



hive     13767  0.0  1.9 841080 159768 ?       Sl   Apr15  17:00 /usr/java/default/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx268435456 -Djava.net.preferIPv4Stack=true -Xmx268435456 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/lib/hive-service-0.10.0-cdh4.2.0.jar org.apache.hadoop.hive.metastore.HiveMetaStore -p 9083

410      13853  0.0  0.2 2207844 22824 ?       Ss   Apr15   0:00 postgres: hive hive 10.0.4.1(56963) idle            

410      13854  0.0  0.1 2206552 8388 ?        Ss   Apr15   0:00 postgres: hive hive 10.0.4.1(56964) idle            

dadeniji 18749  0.0  1.8 814332 152732 pts/0   Sl+  May10   0:21 /usr/java/default/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx268435456 -Xmx268435456 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/bin/../lib/hive/lib/hive-cli-0.10.0-cdh4.2.0.jar org.apache.hadoop.hive.cli.CliDriver

  • We have 4 processes bearing the “hive” name

Service Process

  • It is identifiable as a Hive Service via its name hive-service*.jar
  • It is running under the “hive” account name.  Its Process ID is 13767.  One of the Jar files referenced is /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/lib/hive-service-0.10.0-cdh4.2.0.jar 
  • The Cloudera Version# is 4.2 and Hive Version# is 0.10

Client Process

  • It is identifiable as a Hive Client via its name hive-cli*.jar
  • It is running under my username (dadeniji), as I kicked it off.  Its Process ID is 18749.  One of the Jar files referenced is /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/bin/../lib/hive/lib/hive-cli-0.10.0-cdh4.2.0.jar
  • The Cloudera Version# is 4.2 and Hive Version# is 0.10

“Postgress” Process

  • Hive’s uses an embedded postgress database
  • The processes are running under account 410

Determine Hive Version – Cloudera Manager Admin Console & Command Shell

  • Launch Web Browser
  • Connect to Admin console ( http://<clouderaManagerServices>:<port>).  In our case http://hadoopCMS:7180; as Cloudera Manager Service is running on a machine named hadoopCMS and we kept the default port# of 7180
  • The initial screen displayed in the Service Status page (/cmf/services/status)
  • Click on the service we are interested in (hive1)
  • The service’s specific “Status and Health Summary” screen is displayed.  In this case “Hive1 – Services and Health Summary” page
  • In the row labelled “Hive MetaStore Server” Click on the link underneath the “Status” column
  • This will bring you to the “hivemetastore” summary page.
  • For each Hive host, Hive process information and links the Hive Logs are displayed
  • On the “Show Recent Logs” row, click on “Full Stdout” log
  • The stdout.log appears – Here is break of what is provided

stdout.log


Mon Apr 15 21:06:24 UTC 2013
using /usr/java/default as JAVA_HOME

using 4 as CDH_VERSION

using /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive 
    as HIVE_HOME

using /var/run/cloudera-scm-agent/process/22-hive-HIVEMETASTORE 
    as HIVE_CONF_DIR

using /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop 
    as HADOOP_HOME

using /var/run/cloudera-scm-agent/process/22-hive-HIVEMETASTORE/hadoop-conf as HADOOP_CONF_DIR

Starting Hive Metastore Server

Java version

We quickly see that JAVA_HOME is defined as /usr/java/default.

To see what files constitute /usr/java/default

  ls /usr/java/default

Output:

Hadoop - Clopudera Manager - Java Version

Explanation:

  • /usr/java/default is symbolically linked to /usr/java/latest
  • /usr/java/latest is symbolically linked to /usr/java/jdk1.7.0_17 
Cloudera Distribution version

Based on the screen shot below, the CDH Version is 4

using 4 as CDH_VERSION
Hive Home

Based on the screen shot below, the Hive Home is /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive

using /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive as HIVE_HOME

Again, let us return to the command shell and see what files are in /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive

Please add /lib suffix to get to the Jar files and only get only jar files that have hive in their names.



ls /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/lib/hive*.jar

Output:

Hadoop - Cloudera Manager - Server - ls Sqoop Jar files

Cloudera Manager Admin Console – Service Status

Hadoop - Cloudera Manager - Services - Status

Cloudera Manager Admin Console – Hive1 – Status and Health Summary

Hadoop - Cloudera Manager - Services - Status and Health Summary

Cloudera Manager Admin Console – Hive1 – Status Summary

Hadoop - Cloudera Manager - Hive - Status

Cloudera Manager Admin Console – Hive1 – Status Summary – Log – Stdout.log 

Hadoop - Cloudera Manager - Hive - Status - Log - stdout

Conclusion

It thus appears that we are running a version of Hive (0.10) in this case that it did not support the TimeStamp datatype.

The problem can also be with the version of Sqoop we have running or Sqoop’s ability to detect SQL Server’s datetime datatype or datetime data representation in general.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s