Technical: Hadoop – Sqoop on Cloudera (CDH) – Is Sqoop Set up and Configured for MS SQL Server

Technical: Hadoop – Sqoop on Cloudera (CDH) – Is Sqoop Setup and configured for MS SQL Server?

Introduction

As part of my roadmap towards Hadoop understanding, I am looking to how to use Sqoop.

What is Sqoop?

The name Sqoop is an acronym for “SQL to Hadoop“.

 http://sqoop.apache.org/
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Document Current Configuration

Launch web browser and connect to your services hub:

http://cloudera-manager/cmf/services/status

Screen Shot

Cloudera - CDH4 - Manager

So here are the services:

  • hbase
  • hdfs
  • hive
  • hue
  • impala
  • mapreduce
  • oozie
  • zookeeper

No sqoop.  what to to?

Delivery Mechanism

The items listed above are all services. But, Sqoop is not a service – Services are always running.  Sqoop only needs to be activated when you need to import\export data.

Sqoop Packaging

To figure out how SQLOOP is packaged for our Hadoop version we consulted with:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/PkgVer/3.25.2013/CDH-Version-and-Packaging-Information/cdhvd_topic_3_1.html?scroll=topic_3_1

CDH4 Project Package Version Tarball Version
Apache Sqoop sqoop-1.4.2+59 sqoop-1.4.2-cdh4.2.1.tar.gz
Apache Sqoop2 sqoop2-1.99.1+32 sqoop2-1.99.1-cdh4.2.1.tar.gz

Is Sqoop installed?

Is Sqoop installed? The truth is I do not know.

So let us try the brute force approach.

Sqoop Version

Issue a command to check the Sqoop Version#



    Syntax:
      sqoop version

    Output:
      Sqoop 1.4.2-cdh4.2.0
      git commit id 8885ea41b504f0b8f16f56b43b083df592b192da
      Compiled by jenkins on Fri Feb 15 12:07:28 PST 2013

Screen shot:

Apache - sqoop - version

What does all this mean?

  • The version# is Sqoop 1.4.2-cdh4.2.0
  • The compile date is 2013-02-15 

So from everything, it looks like we are good.  The most recent sqoop for our distribution is installed and runnable.

Identify – Database Drivers

Background

Depending on which database we want to communicate with, we have to install specific Database Drivers.

As Hadoop is mostly Java based, JDBC Drivers in this case.

 

Identify – Database Drivers – MS SQL Server

Microsoft SQL Server JDBC Drivers are quite backward compatible; that is irrespective of the version of MS SQL Server you want to communicate with, you can usually get and install the latest drivers.

The latest Microsoft SQL Server is Microsoft JDBC Driver 4.0 for SQL Server.

It is available @ http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774

Here is a screen shot:

MS SQLServerDriver 4

Nice. One for Microsoft Windows (sqljdbc_4.0.2206.100_enu.exe), and another for everyone else (sqljdbc_4.0.2206.100_enu.tar.gz).

This one time we are the everyone else.

Download via web browser or mget, etc.

Copy Database Drivers to Hadoop Node

 

Copy database driver to working Hadoop Node



Syntax:

   --create new folder on remote host
   ssh -l <user> <host-name> "mkdir <remote-folder-name>"

  --scp file to remote host; into remote folder
  scp local-file-name hadoop-node:<remote-folder-name>

Sample:

   --make new folder /tmp/sqljdbc4
   ssh -l dadeniji hadoop-node "mkdir -p /tmp/sqljdbc4"

   --copy downloaded file to hadoop-node 
   scp /Users/daniel_adeniji/Downloads/sqljdbc_4.0.2206.100_enu.tar.gz \
     dadeniji@hadoop-node:/tmp/sqljdbc4

Extract compressed file (tar.gz):

extract compressed file.



Syntax:

   --make tar folder, the new current folder
  cd <dest-folder-name>

  --untar file, and place contents in current folder
  tar xvfz <file-name>

Sample:

   --cd folder
   cd /tmp/sqljdbc4

   --copy downloaded file to hadoop-node 
   tar xvfz  sqljdbc_4.0.2206.100_enu.tar.gz

Review extracted files:

review the extracted files.


Syntax:

   --review extracted folders/files
  ls -la sqljdbc_4.0/enu

Sample:
   ls -la sqljdbc_4.0/enu

 

File Listing:

Here is what our folder and file structure looks like for sqljdbc_4.0/enu:

sqljdbc files

Which file shall we choose:

As we have two files (sqljdbc4.jar and sqljdbc.jar), which one shall we choose.

Well it depends on which JRE is installed?

JDBC Version JRE Version File name
jdbc 3.0 JRE v5.0, v6.0, v7.0 sqljdbc.jar
jdbc 4.0 JRE v6.0, v7.0 sqljdbc4.jar

So what does all this mean?

  • Well JDBC is a specification. MS SQL JDBC4 supports both JDBC 3.0 and JDBC 4.0 specifications
  • Our choice of which file (sqljdbc.jar or sqljdbc4.jar) rests upon which JRE we are using

Let us go find out which JRE we are using?


Syntax:

  java -version

 

java jre -version

Well, easy enough, our version# is 1.70_17-b02

But, keep me honest here, let us draw this out.

Microsoft’s compatibility matrix refers specifically to JRE v5.0, v6.0, v7.0

And, so to be sure I want to know how version#1.7xx maps to JRE v 5, 6, or 7.

The best answer came via:

Java™ Platform Standard Edition 7 Names and Versions
http://www.oracle.com/technetwork/java/javase/jdk7-naming-418744.html

The web page says to take the number(s) following the first period (.) and take that has the Product Number (#); ie the marketing#

So we are running JRE Version 7 and thus we use either sqljdbc.jar or sqljdbc4.jar

We will go with sqljdbc4.0 and revert to sqljdbc if we encounter errors.

 

Identify Hadoop\sqoop Connection Points

Introduction

Now that we have the jar file, we need to know where to place it.

SQOOP User Guide
http://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import, and install the .jar file in the $SQOOP_HOME/lib directory on your client machine. (This will be /usr/lib/sqoop/lib if you installed from an RPM or Debian package.) Each driver .jar file also has a specific driver class which defines the entry-point to the driver. For example, MySQL’s Connector/J library has a driver class of com.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with --driver.

Where is $SQOOP_HOME/lib?

All this talk about Carmen Santiago is making me dizzy?  The toughest thing with me about Linux is there are so many ways to do the same thing…

If it is not symbolic links, it is alternatives

Let us go look for sqoop

which sqoop:



Syntax:

   which <find-name>

Sample:

   which sqoop

which sqoop

       /usr/bin/sqoop

list /usr/bin/sqoop



Syntax:

   ls -la <find-name>

Sample:

   ls -la /usr/bin/sqoop

Screen print:

sqoop - ls

       /usr/bin/sqoop is /etc/alternatives/sqoop

Listing /etc/alternatives/sqoop

get Listing for /etc/alternatives/sqoop



Syntax:

   ls -la <alternative-name>

Sample:

   ls -la /etc/alternatives/sqoop

Screen print:

sqoop - etc--alternatives--sqoop
      /etc/alternatives/sqoop 

—- is —

       /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/bin/sqoop

Listing /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/bin

I already know that I probably should not bother to look for the lib folder ( or jar files) in a bin folder.  But, I am in learning mode, and so I issue:


ls -la /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/bin


sqoop - ls parcels-sqoop-bin

So there you go, no lib, no jar files

 

Google for Sqoop Libraries

Googled for help and found “where-is-the-sqoop-library-directory” ( http://stackoverflow.com/questions/15502220/where-is-the-sqoop-library-directory )

Mad Echet answered his own question:

/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib

Confirm Sqoop Library Folder

Get folder listing for /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib



ls -la /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib

Listing for /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib folder listing

Folder Listing 

sqoop - ls parcels-sqoop-lib

Configure Hadoop\sqoop Connection Points

 

Configure Hadoop\sqloop Connection Points – Microsoft SQL Server

Copy Microsoft SQL Server Files

The Microsoft SQL Server files are primarily Jar files. In our case, a lone Jar file; named sqljdbc4.jar

Let us go copy it:

syntax:
sudo cp <sql-jdbc-extracted-file> <sqoop-lib

sudo  cp /tmp/sqljdbc4/sqljdbc_4.0/enu/sqljdbc4.jar \
        /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib

Nomenclature

Nomenclature – Microsoft SQL Server

Database / Driver Specific

SQL Server:

Item Syntax Sample
JDBC-URL jdbc:sqlserver://[serverName[\instanceName][:portNumber] jdbc:sqlserver://hrdb
Connection Manager com.microsoft.sqlserver.jdbc.SQLServerDriver
Driver com.microsoft.sqlserver.jdbc.SQLServerDriver com.microsoft.sqlserver.jdbc.SQLServerDriver

Sample JDBC-URL:

  • Specifies specific database name (hrdb)
    “jdbc:sqlserver://hrdb;database=demo”
    ( be sure to enclose in double-quotes)

Processing

Processing – Microsoft SQL Server

Command – List Databases

 

Syntax 
      sqoop list-databases --connect jdbc:sqlserver://sqlserver-name 
         / --username <username> 
         / --password <password> 
         / --driver <driver-manager-class> 

Sample 
      sqoop list-databases --connect jdbc:sqlserver://labDB
          / --username sqoop 
          / --password simp1e 
          / --driver com.microsoft.sqlserver.jdbc.SQLServerDriver

Output:


WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

WARN sqoop.ConnFactory: Parameter --driver is set to an explicit driver however appropriate connection manager is not being set (via --connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly which connection manager should be used next time.

INFO manager.SqlManager: Using default fetchSize of 1000

ERROR manager.SqlManager: Generic SqlManager.listDatabases() not 
implemented.
Could not retrieve database list from server

ERROR tool.ListDatabasesTool: manager.listDatabases() returned null

Command – List tables

List the tables

 

Syntax 
      sqoop list-tables --connect jdbc:sqlserver://sqlserver-name 
         / --username <username> 
         / --password <password> 
         / --driver <driver-manager-class> 

Sample 
      sqoop list-tables --connect jdbc:sqlserver://labDB
          / --username sqoop 
          / --password simp1e 
          / --driver com.microsoft.sqlserver.jdbc.SQLServerDriver

Command – eval

Execute SQL Command

 

Syntax 
      sqoop eval 
          --connect jdbc:sqlserver://sqlserver-name /
          --username <username> /
          --password <password> /
          --driver   <driver-manager-class> / 
          --query    <query>

Sample 
      sqoop eval 
          --connect jdbc:sqlserver://labDB /
          --username sqoop /
          --password simp1e / 
          --driver com.microsoft.sqlserver.jdbc.SQLServerDriver /
          --query /
          " select schema_name(uid) as schemaName, name, crdate /
          from dbo.sysobjects where type = 'U'  "

Output:

Hadoop - sqloop - Execute Query - sys.objects

5 thoughts on “Technical: Hadoop – Sqoop on Cloudera (CDH) – Is Sqoop Set up and Configured for MS SQL Server

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s