Hadoop – Sqoop – Importing Data (from Microsoft SQL Server)

Pre-requisites

Hopefully, you have installed\validated that Hadoop\Sqoop is installed and running properly.

If not, please read Technical: Hadoop – Sqoop on Cloudera (CDH) – Is Sqoop Set up and Configured for MS SQL Server (https://danieladeniji.wordpress.com/2013/05/03/technical-hadoop-sqoop-on-cloudera-cdh/ )

Command – import

Copy data from Database Table to HDFS File System

In the example below, our database & hdfs configuration is:

  • server name :- labDB
  • database name :- demo
  • SQL user name :- sqoop
  • SQL password :- simp1e
  • Driver Class Name :- com.microsoft.sqlserver.jdbc.SQLServerDriver
  • Table :- dbo.customers
  • Target Directory : /tmp/dbo-customer (HDFS Folder name)
 

Syntax:
      sqoop import --connect jdbc:sqlserver://sqlserver-name \
          --username <username> \
          --password <password> \
          --driver   <driver-manager-class> \
          --table    <table-name> \
          --target-dir  <target-folder-name>

Sample: 
      sqoop import --connect "jdbc:sqlserver://labDB;database=demo" \
           --username sqoop  \
           --password simp1e \
           --driver   com.microsoft.sqlserver.jdbc.SQLServerDriver \
           --table    "dbo.customer" \
           --target-dir  "/tmp/dbo-customer"
          

Error Messages:

  • ERROR tool.ImportTool: Error during import: No primary key could be found for table dbo.customer.


13/05/04 04:14:37 ERROR tool.ImportTool: Error during import: No primary 
key could be found for table dbo.customer. Please specify one 
with --split-by or perform a sequential import with '-m 1'.

Screen Shot:

Hadoop -sqoop - Import - Error - No primary key

Captured SQL Profiler:

Hadoop -sqoop - Import - Error - No primary key

Interpretation:

  • From the SQL Server Profiler output, it seems that Sqoop is not even trying to determine whether primary keys exist

 

Microsoft – SQL Server – Documented path of determining Primary Keys:

The path to understanding primary keys and the columns that are part of the primary key is well understood and discussed here:

https://danieladeniji.wordpress.com/2013/05/08/technical-microsoft-sql-server-get-primary-key-columns/

Permission Error

Depending on the user you ‘re running Sqoop under and which operations  you ‘re trying to perform, you might experience the permission problems like the one pasted below:



ERROR security.UserGroupInformation: PriviledgedActionException as:
dadeniji (auth:SIMPLE) 
cause:org.apache.hadoop.security.AccessControlException: 
Permission denied: user=dadeniji, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x

   at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check

ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.security.AccessControlException: Permission denied: user=dadeniji, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:205)

Cleaned-up:

  • Sqoop communicates directly with the recipient Data Source.  In this case, HDFS, so we use sudo to take on the HDFS username.  Obviously, your Cloudera Distribution default account or your installation choices might mean that you are running HDFS under a different name


Syntax 
    sudo -u hdfs  sqoop import  \
          --connect <jdbc:sqlserver://sqlserver-name> \
          --username <username> \
          --password <password> \
          --driver   <driver-manager-class> \
          --table    <query> \
          --target-dir  <folder-name> \
          --split-by <column-name>

Sample 
      sudo -u hdfs sqoop import \
           --connect "jdbc:sqlserver://labDB;database=demo" \
           --username sqoop \
           --password simp1e \
           --driver   com.microsoft.sqlserver.jdbc.SQLServerDriver \
           --table    "dbo.customer" \
           --target-dir  "/tmp/dbo-customer" \
           --split-by "customerID"

Parallel Tasks

To drive efficiency Sqoop can be configured to run multiple child processes.  When asked to run this way, Sqoop reviews the targeted data and attempts to determine the highest and lowest data points and place dividing markers along the data points.

$CONDITIONS

It is a bit tricky for the system to know where to place the parallelism SQL Stub in a submitted Query statement.

SQL Statements can be notoriously complex and can contain JOIN and sub queries.  To allow the system to be able to concisely know where to place the code that will best support parallelism, the SQOOP framework introduces the $CONDITION token.

The Dollar ($) Sign

Unfortunately, the Dollar ($) sign is a special character in Bash Scripts and the system tries to account for its special implications.

 

  • If you are using double-quotes, then place a back-slash in front of $CONDITIONS.  The back slash indicates to the system that the $ is being used as a special character.
  • On the other hand, the single quotes does not need the back slash

--single quotes, $CONDITIONS does not need to be escaped
--query ' select top 100 * from dbo.[Orders] where $CONDITIONS ' \
   --split-by callID -m 10

--double quotes, $CONDITIONS needs to be escaped
--query " select top 100 * from dbo.[Orders] where \$CONDITIONS " \
   --split-by callID -m 10

References:

 

–NUM-MAPPERS ( or -m)

One can specify the number of parallel processes ( map tasks) by using the -m or –num-mappers argument.

The system goes out and gets the minimum and maximum values for the “Split By” column.

Conceptually to arrive at the number of records in each batch, a formula like the one pasted below is used:

Hadoop - Sqoop - Block Size

Parallel Tasks – Table



Syntax:

   sudo -u hdfs sqoop import \
        --connect <jdbc:sqlserver://DBServer;database=database> \
        --username <sqlUser> \
        --password <sqlPassword> \ 
        --driver  com.microsoft.sqlserver.jdbc.SQLServerDriver \
        --verbose \
        --table <table-name> \
        --split-by <split-column> \ 
        --where <where-clause> \
        -m <# of buckets> \
        --target-dir <hdfs-url>

Sample:

     sudo -u hdfs sqoop import \
           --connect "jdbc:sqlserver://DBServer;database=database" \
           --username "sammie" \
           --password "p1smith" \
           --driver  com.microsoft.sqlserver.jdbc.SQLServerDriver \
           --verbose  \
           --table "dbo.customer" \ 
           --split-by "orderID" \
           --where "1=1" \
           -m 10  \
           --target-dir  "/tmp/dbo__order__20130504

Output:


INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(orderID), 
MAX(orderID) FROM dbo.[Orders] WHERE ( 1=1 )

WARN db.TextSplitter: Generating splits for a textual index column.

WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

WARN db.TextSplitter: You are strongly encouraged to choose an integral 
split column.

DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound 'orderID >= 'minimumID' and upper bound 'orderID < 'maximumID' ''

Parallel Tasks – Query



Syntax:

   sudo -u hdfs sqoop import \
        --connect <jdbc:sqlserver://DBServer;database=database> \
        --username sqlUser \
        --password sqlPassword \ 
        --driver  com.microsoft.sqlserver.jdbc.SQLServerDriver \
        --verbose  \
        --query <query clause where $CONDITIONS>   \
        --split-by <split-column> \
        --where <where-clause> \
        -m 10  \
        --target-dir <hdfs-url>

Sample:

     sudo -u hdfs sqoop import \
           --connect jdbc:sqlserver://hadoopHRLab;database=dbHR \
           --username "sammie" \
           --password "p1smith" \
           --driver  com.microsoft.sqlserver.jdbc.SQLServerDriver \
           --verbose  \
           --query \
              "select * from dbo.orders where \$CONDITIONS" \
           --split-by "orderID" \ 
           --where " state='CA' " \
           -m 10  \
           --target-dir  "/tmp/dbo__order__20130504"

Output:



WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments.Applications should implement Tool for the same.
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(orderID), MAX(orderID) FROM (select top 100 * from dbo.[Orders] where (1 = 1) ) AS t1

WARN db.TextSplitter: Generating splits for a textual index column.

WARN db.TextSplitter: If your database sorts in a 
case-insensitive order, this may result in a partial import or duplicate 
records.

WARN db.TextSplitter: You are strongly encouraged to choose an integral 
split column.

Important Notes:

  • Textual (String) Values usage in Splits is a bit disfavored and justifiable so.
  • Results of Strings Query can vary depending on Collation and Sort Order

Command – Import (HBase)

Insert into HBase Table



Syntax:

   sudo -u hdfs sqoop import \
        --connect <jdbc:sqlserver://DBServer;database=database> \
        --username <sqlUser> \
        --password <sqlPassword> \
        --driver  com.microsoft.sqlserver.jdbc.SQLServerDriver \ 
        -m <# of parallelism> \ 
        --hbase-table <hbase-tableName> \ 
        --column-family <column-family> \
        --table <database-tableName> \
        --split-by <database-column-to-split-by>

Sample:

      sudo -u hdfs sqoop import \
           --connect  "jdbc:sqlserver://labDB;database=DEMO"  \
           --username "sammie" \
           --password "p1smith" \
           --driver "com.microsoft.sqlserver.jdbc.SQLServerDriver" \
           -m 1 \
           --hbase-table "customer" \
           --column-family "cf1" \
           --table "dbo.customer"  \
           --split-by "customerID" 

Validation

Validation – HDFS

If you ‘re importing into HDFS here are some commands you can use to clear your destination table, get your output folder structure, and display the post output files.

Validation – HDFS – FS (-rmr or -rm -r)

remove existing hdfs destination folder if it exists



Syntax 
    sudo -u hfds hadoop fs -rm -r <folder>

Sample 
    sudo -u hdfs hadoop fs -rm -r /tmp/db-customer--201305060127PM

Validation – HDFS – FS (-ls)



Syntax 
    sudo -u hfds  hadoop fs -ls <folder>

Sample 
    sudo -u hdfs hadoop fs -ls /tmp

output:

hadoop-fs-import-hdfs-ls

Validation – HDFS – FS (-cat)



Syntax 
    sudo -u hfds  hadoop fs -cat <folder>

Sample 
      sudo -u hdfs hadoop fs -cat /tmp/db-customer--201305060101PM/part-m-00000

output:

hadoop-fs-import-hdfs

Bugs

Splitting by String Column (with special characters)

When we attempted to split-by column “customerID”:

Hadoop - sqoop - java.lang.ArrayIndexOutOfBoundsException (v2) -- InputData

we received the error listed below:

Sqoop: java.lang.ArrayIndexOutOfBoundsException

Screen Shot:

Hadoop - sqoop - java.lang.ArrayIndexOutOfBoundsException (v2)

Text Output:



INFO mapreduce.ImportJobBase: Beginning import of dbo.customer
13/05/06 20:38:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM dbo.customer AS t WHERE 1=0

WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(customerID), MAX(customerID) FROM dbo.customer

WARN db.TextSplitter: Generating splits for a textual index column.

WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

WARN db.BigDecimalSplitter: Set BigDecimal splitSize to MIN_INCREMENT
13/05/06 20:38:40

INFO mapred.JobClient: Cleaning up the staging area hdfs://hbDesk:8020/user/hdfs/.staging/job_201304152105_0042

ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.ArrayIndexOutOfBoundsException: -1 java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:371) at java.util.ArrayList.get(ArrayList.java:384) at org.apache.sqoop.mapreduce.db.BigDecimalSplitter.split(BigDecimalSplitter.java:143) at org.apache.sqoop.mapreduce.db.TextSplitter.split(TextSplitter.java:151) at org.apache.sqoop.mapreduce.db.TextSplitter.split(TextSplitter.java:113) at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:164) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1064) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1081)

On the other hand, if our database table looks like this:

When we attempted to split-by column “customerID”:

Hadoop - sqoop - Input Data (good)

Good Split (Screen Shot):

Hadoop - Good Splits

Good Split (Textual)



INFO mapreduce.ImportJobBase: Beginning import of dbo.customer

INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM dbo.customer AS t WHERE 1=0

WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(customerID), MAX(customerID) FROM dbo.customer

WARN db.TextSplitter: Generating splits for a textual index column.

WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

WARN db.TextSplitter: You are strongly encouraged to choose an integral 
split column.

Tips

In console mode, hide good output:

There is quite a bit of console output entries that shows up in the Console when one issues Sqoop, to hide some of the extraneous data, please try “greping” for WARN, ERROR, EXCEPTION, java.io.IOException.

TLDP.org — 3 file descriptors, stdin, stdout and stderr (std=standard)

http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html



Syntax:
    command 2>&1 | egrep  -i  "<entry1>|<entry2>|<entry3>"

Sample:
   sh exportHBASE.sh 2>&1 | egrep  -i  \ 
      "WARN|ERROR|EXCEPTION|java.io.IOException:"

References

References – User Guide

References – Hadoop – HDFS Shell Commands

References – Redirection

5 thoughts on “Hadoop – Sqoop – Importing Data (from Microsoft SQL Server)

  1. Hi first let me personally thank your for this outstanding post. it helped a lot. few observation.
    1) at times it did not work if i give table name with dbo.tablename, so i have remove dbo. then it works.
    2) if column has null value it is not handled,may be some patch required i guess.

    • Gonnes:

      Sorry I am a bit busy at this time and unfortunately do not have access to a Linux box.

      But, how are you designating null columns in Hadoop\Hbase.

      Is the column always null or in some cases?

      Also, does SQOOP supporting into views. If so, may be you can create a view and import to views instead?

      You might also consider creating staging tables, import into staging table, insert\select from staging to main, truncating\dropping staging.

      Sorry but not currently actively using Hadoop\Sqoop.

      Daniel

  2. Hi Daniel,

    Your tutorial was really helpful in getting me started on Sqoop.
    I am still getting the “Sqoop: java.lang.ArrayIndexOutOfBoundsException” as long as “special characters” appear in the record. In example posted, you changed “[” to “A” to resolve the error, however, what would you do in case of a GUID field of VARCHAR type. Is there a general/better solution in which we can keep the special characters as part of the record?
    Thanks,

  3. Hi Danieladenji,

    I am trying to connect Mssql server to Hadoop I am getting error ..I spend lot of day for this error would please help me to find a solutions for the error ….when i used to list the databases in sqlserver using sqoop command …I get a error given below………………………………………………..

    com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host 192.168.1.26, port 1433 has failed. Error: “connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.”.
    at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:191)
    at com.microsoft.sqlserver.jdbc.SQLServerException.ConvertConnectExceptionToSQLServerException(SQLServerException.java:242)
    at com.microsoft.sqlserver.jdbc.SocketFinder.findSocket(IOBuffer.java:2280)
    at com.microsoft.sqlserver.jdbc.TDSChannel.open(IOBuffer.java:493)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1387)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:1068)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:904)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:451)
    at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:1014)
    at java.sql.DriverManager.getConnection(DriverManager.java:571)
    at java.sql.DriverManager.getConnection(DriverManager.java:233)
    at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:883)
    at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
    at org.apache.sqoop.manager.CatalogQueryManager.listDatabases(CatalogQueryManager.java:57)
    at org.apache.sqoop.tool.ListDatabasesTool.run(ListDatabasesTool.java:49)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
    16/07/27 16:35:24 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host 192.168.1.26, port 1433 has failed. Error: “connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.”.
    java.lang.RuntimeException: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host 192.168.1.26, port 1433 has failed. Error: “connect timed out. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall

    • Mani:

      I will suggest the following.

      1) From the host where Hadoop is running, try to telnet to the MS SQL Server Host
      telnet 192.168.1.26 1433

      2) Write a quick Java application and to reference same jdbc jar file ( com.microsoft.sqlserver.jdbc ) to connect to SQL Server

      3) Check your MS SQL Server Error Log and see if connection errors are appearing. First make sure that your SQL Server is configured via “Login auditing” to “Failed Logins Only” or “Both failed and successful logins”

      4) Run SQL Server Profiler on the SQL Server Host

      With thanks,

      Daniel

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s