Technical: Hadoop – Hive – Loading Data

Technical: Hadoop – Hive – Loading Data

Introduction

As a data-warehouse project, Hive does not support the traditional singleton database insert record statements.

Data inserts has to be through bulk-record modes.  One of those avenues is through a built-in load statement.

Data Model – Logical

Looked online for sample data models  that are representative of familiar data entities.  And, the one I finally settled on is a book entity.

It happens to be an XML representation:

http://msdn.microsoft.com/en-us/library/windows/desktop/ms762271(v=vs.85).aspx

I think we all remember when XML was going to take over the world as it crosses OS and cultural boundaries dues to its preciseness.

I am not sure how all that argument is now being jettisoned for no-sql.

Nevertheless, here is the Microsoft definition of a book entity:

Item Description
author String
title String
edition String
genre String
price Decimal
publishDate TimeStamp
description String
haveit bit

Data Model – Physical

Here is Hive’s implementation of our logical data model.

Item Description
author String
title String
edition String
genre String
price Decimal ( >= Hive 0.11)float ( < Hive 0.11)
publishDate TimeStamp ( >= Hive v 0.8.0)String (< Hive v 0.8.0)
description String
haveit binary ( >= Hive 0.8)tinyint ( < Hive 0.8)

Hive – Create Table

Prepare – Create Table Statement

It is a bit easier to just use an editor to enter your ddl statements.

We used vi to create /tmp/hive/helloWorld/dbBook__CreateTable__delimited.sql.


      drop table bookCharacterDelimited;

      create table bookCharacterDelimited
      (
           author      string
         , title        string
         , edition      string
         , genre        string
         , price        decimal
         , publishDate  timestamp
         , description  string
         , haveit       binary
         , acquireDate  timestamp

    )

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '~'
    STORED AS TEXTFILE 
    ;

    quit;

Actual – Execute Create Table Statement

Once we have a file that contains our create statement, we invoke hive client and indicate that we will like to submit a sql file via -i.



   hive -i /tmp/hive/helloWorld/dbBook__CreateTable__Delimited.sql

Hive – Table – Data

Hive – Table – Data

Create a text file.  The name of the text file is /tmp/book.txt



Mohammed Khaled~Valley of the doors~1~Fiction1~10.45~2013-01-01~Afghan~0
Jackie Collins~Accident~2~12.56~TECH~27.87~2012-01-17~Lovely~1
Uwem Akpan~Say you're one of them~Fiction
Elizabeth Berg~Open House~Fiction
Kaye Gibbons~A Virtuos Woman
Khaled Hosseini~And the mountains echoed
Valorie Schaefer~The Care and Keeping of you~1
Malcom Gladwell~Outliers
Carmen Reinhart & Kenneth Rogoff~This time is different: Eight Centuries of financial folly
Steven Pinker~The Better Angels of our Nature: Why Violence has declined
Vivien Stewart~A World-Class Education

Hive – Table – Data – Load

Invoke Hive Client (hive) to load the data:


hive -e "load data local inpath '/tmp/book.txt' into table bookCharacterDelimited;"

Hive – Table – View Data

Hive – Table – View Data

Use hive query tool to view the data:


hive -e "select * from bookCharacterDelimited;"

Hive – Create Table – Problem Areas

Introduction

Let us try using a different delimiter (;)

Prepare – Create Table Statement

Create Statements placed in /tmp/hive/helloWorld/dbBook__CreateTable__Delimited__SemiColon.sql


      drop table bookCharacterDelimited__SemiColon;

      create table bookCharacterDelimited__SemiColon
      (
           author      string
         , title        string
         , edition      string
         , genre        string
         , price        decimal
         , publishDate  timestamp
         , description  string
         , haveit       binary
         , acquireDate  timestamp

    )

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS TEXTFILE 
    ;

    quit;

Actual – Execute Create Table Statement

Once we have a file that contains our create statement, we invoke hive client and indicate that we will like to submit a sql file via -i.



hive -i /tmp/hive/helloWorld/dbBook__CreateTable__Delimited__SemiColon.sql

Output (Textual):

  • FAILED: ParseException line 19:22 mismatched input ‘<EOF>’ expecting StringLiteral near ‘BY’ in table row format’s field separator

Output (Pictorial):

Hive -- CreateTable -- Expecting StringLiteral

Explanation:

  • It seems that the Hive application is quite picky about delimiters that are allowed
  • If delimiters are not specified, the default choice is ctrl-A

 

Hive – Create Table – Native

  • We created a new file /tmp/hive/helloWorld/dbBook__CreateTable__delimited.sql
  • A delimiter is not specified and so we assume the default of CTRL-A

      drop table booknative;

      create table booknative
      (
           author      string
         , title        string
         , edition      string
         , genre        string
         , price        decimal
         , publishDate  timestamp
         , description  string
         , haveit       binary
         , acquireDate  timestamp

    )

    ROW FORMAT DELIMITED
    STORED AS TEXTFILE 
    ;

    quit;

Replace original tilde (~) with CTRL-A

using sed or tr replace the original (~) with Ctrl-A (Hex 01)

In this case we chose to use tr:


cat dbBook_Data_in.txt | tr '~' $'\x01' > dbBook_Data_in_native.txt

Explanation:

  • The original file name is dbBook_data_in.txt
  • replace ~ with \x01
  • The resultant file name is dbBook_Data_in_native.txt

Load data into Hive Table

load data / over-write contents

hive -e "load data local inpath '/tmp/hive/helloWorld/dbBook_Data_in_native.txt' overwrite into table bookNative;'

Validate data

validate data:


hive -e "select * from bookCharacterDelimited;"

References

References – Hive – Data Definition Language

References – Hive – Data Types

References – Control Characters

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s