Microsoft – SQL Server – Bulking Copying data using XML Template File

Background

Preparing to perform a bit of load test against MS SQL Server.  And, so downloaded spawner, a MS Windows desktop application that allows one to generate test data.

We were able to generate about 1 million records.

 

Generate test data

 

Spawner

Here is what the Spawner desktop looks like:

Spawner

 

 

Create Table

Create Table

use [DBLabInMemory]
go

set noexec off
go


/*
 drop table [dbo].[personTraditionalClustered]
*/
if object_id('[dbo].[personTraditionalClustered]') is not null
begin
 set noexec on
end
go



create table [dbo].[personTraditionalClustered]
(

 [firstname] [varchar](40) 
      COLLATE Latin1_General_100_BIN2 NOT NULL

 , [lastname] [varchar](40) 
      COLLATE Latin1_General_100_BIN2 NOT NULL

 , [emailAddress] [varchar](40) 
      COLLATE Latin1_General_100_BIN2 NOT NULL

 , [phoneNumber] [varchar](40) 
      COLLATE Latin1_General_100_BIN2 NOT NULL

 , [addressStreet] [varchar](100) 
      COLLATE Latin1_General_100_BIN2 NOT NULL

 , [addressCity] [varchar](40) 
      COLLATE Latin1_General_100_BIN2 NOT NULL

 , [addressState] [varchar](40) 
      COLLATE Latin1_General_100_BIN2 NOT NULL

 , [addressPostalCode] [varchar](40) 
      COLLATE Latin1_General_100_BIN2 NOT NULL

 , [dateofBirth] datetime null


 , [uniqueID] uniqueIdentifier not null
     constraint defaultPersonTraditionalUniqueID 
        default newid() 



 , constraint PK_PersonTraditional
     primary key ([uniqueID])


 , index idx_Firstname nonclustered
     ([firstname])

 , index idx_Lastname nonclustered
     ([lastname])

 , index idx_AddressCity nonclustered
     ([addressCity])


 , index idx_AddressState nonclustered
     ([addressState])

 , index idx_AddressPostalCode nonclustered
     ([addressPostalCode])

 , index idx_EmailAddress nonclustered
     ([emailAddress])

 , index idx_DateofBirth nonclustered
     ([dateofBirth])
 

)
go

set noexec off
go

Recommendation

As recommendation, I will suggest that one follows the following strategy:

  • Create table
  • Insert test data into the table
  • create template file
  • Customize template file (separators, column ordering, number of columns)
  • Export a few trial records
  • Experiment with importing data

 

Create Table

Create database table; in our case we created a table earlier in this post

Insert test data

Inserting test data is useful and we can better determine if we have the correct data types, column lengths and nullability indicator.

 

Create XML Template file

Syntax

bcp <table> format nul -c -f <format-file> -x -c -S <server_name> -d <database-name> -T -q 

 

Sample

bcp dbo.personTraditionalClustered format nul -x -c 
   -f targetFileTemplate.xml -x -S (local) -d DBLabInMemory -T -q 

This is sufficient when the incoming data file has all the columns that are in the destination table.  If not all of the columns are present or if they are not present in the same order, I will suggest one use an editor and adjust the created format file.

The error messages that we are trying to avoid is stated below:

When we have more columns in SQL Server than in the datafile

Unexpected EOF encountered in BCP data-file

 

Customize format file

Here is what our format file looks like:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
 <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="3" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="4" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="5" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="100" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="6" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="7" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="8" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="9" xsi:type="CharTerm" TERMINATOR="\t" MAX_LENGTH="24"/>
 <FIELD ID="10" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="37"/>
 </RECORD>
 <ROW>
 <COLUMN SOURCE="1" NAME="firstname" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="2" NAME="lastname" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="3" NAME="emailAddress" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="4" NAME="phoneNumber" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="5" NAME="addressStreet" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="6" NAME="addressCity" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="7" NAME="addressState" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="8" NAME="addressPostalCode" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="9" NAME="dateofBirth" xsi:type="SQLDATETIME"/>
 <COLUMN SOURCE="10" NAME="uniqueID" xsi:type="SQLUNIQUEID"/>
 </ROW>
</BCPFORMAT>


 

There are a couple of changes we will be making to our template file

  • Because our input data file does not have the the uniqueID column in it, we will remove the corresponding lines from the created format file.
  • We will also change our Terminator from \t (tab) to ||

Here is what our new format file:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
 <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="||" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="||" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="3" xsi:type="CharTerm" TERMINATOR="||" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="4" xsi:type="CharTerm" TERMINATOR="||" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="5" xsi:type="CharTerm" TERMINATOR="||" MAX_LENGTH="100" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="6" xsi:type="CharTerm" TERMINATOR="||" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="7" xsi:type="CharTerm" TERMINATOR="||" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="8" xsi:type="CharTerm" TERMINATOR="||" MAX_LENGTH="40" COLLATION="Latin1_General_100_BIN2"/>
 <FIELD ID="9" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="24"/>
 </RECORD>
 <ROW>
 <COLUMN SOURCE="1" NAME="firstname" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="2" NAME="lastname" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="3" NAME="emailAddress" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="4" NAME="phoneNumber" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="5" NAME="addressStreet" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="6" NAME="addressCity" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="7" NAME="addressState" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="8" NAME="addressPostalCode" xsi:type="SQLVARYCHAR"/>
 <COLUMN SOURCE="9" NAME="dateofBirth" xsi:type="SQLDATETIME"/>
 </ROW>
</BCPFORMAT>


 

Export a few trial records

Let us export out a few records

bcp "select top 10 [firstname],[lastname],[emailAddress],[phoneNumber],[addressStreet],[addressCity],[addressState],[addressPostalCode],[dateofBirth] from [dbo].[personTraditionalClustered] with (nolock)" queryout extractedData_v4.txt -c -S (local) -d DBLabInMemory -T

 

In the example above there are a couple of things one should note:

  • We are only taking out some data; just the the first 10 records
  • We have also restricted the columns to the ones we have in our input file

 

Insert test data

Let us perform some trial runs

sqlcmd -S (local) -d DBLabInMemory -E -Q "truncate table [dbo].[personTraditionalClustered]"

bcp [dbo].[personTraditionalClustered] in datafile_v4.txt -f fileTemplate_v4.xml -S (local) -d DBLabInMemory -e errfile.log -T -F 2 -L 4

 

  • In our sample data file, we have added a header line and so we need -F 2 to indicate that raw data starts at line 2
  • We have also added -L 4 to indicate import should end at line 4
  • And, so we are in essence only importing 3 lines

 

Actual Run

A couple of changes we will be making in the actual run

  • Added -b to indicate that we will like to change our block size from the default of 1000 to 50000

 

sqlcmd -S (local) -d DBLabInMemory -E -Q "truncate table [dbo].[personTraditionalClustered]"

bcp [dbo].[personTraditionalClustered] in datafile_v4.txt -f fileTemplate_v4.xml -S (local) -d DBLabInMemory -e errfile.log -T -F 2 -L 4

 

Output:

Starting copy...
50000 rows sent to SQL Server. Total sent: 50000
50000 rows sent to SQL Server. Total sent: 100000
50000 rows sent to SQL Server. Total sent: 150000
50000 rows sent to SQL Server. Total sent: 200000
50000 rows sent to SQL Server. Total sent: 250000
50000 rows sent to SQL Server. Total sent: 300000
50000 rows sent to SQL Server. Total sent: 350000
50000 rows sent to SQL Server. Total sent: 400000
50000 rows sent to SQL Server. Total sent: 450000
50000 rows sent to SQL Server. Total sent: 500000
50000 rows sent to SQL Server. Total sent: 550000
50000 rows sent to SQL Server. Total sent: 600000
50000 rows sent to SQL Server. Total sent: 650000
50000 rows sent to SQL Server. Total sent: 700000
50000 rows sent to SQL Server. Total sent: 750000
50000 rows sent to SQL Server. Total sent: 800000
50000 rows sent to SQL Server. Total sent: 850000
50000 rows sent to SQL Server. Total sent: 900000
50000 rows sent to SQL Server. Total sent: 950000
50000 rows sent to SQL Server. Total sent: 1000000
50000 rows sent to SQL Server. Total sent: 1050000
50000 rows sent to SQL Server. Total sent: 1100000
50000 rows sent to SQL Server. Total sent: 1150000

1160001 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 160759 Average : (7215.78 rows per sec.)

 

Version Control

Version Control – Github

Files posted to github @ https://github.com/DanielAdeniji/MSSSQLBCPFormatFileXML

 

Had a little hiccup as our data file exceed the 100 MB limit



[dadeniji@adelia XML]$ git push origin master
Counting objects: 10, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 52.08 MiB | 692 KiB/s, done.
Total 10 (delta 1), reused 0 (delta 0)
remote: error: GH001: Large files detected.
remote: error: Trace: b9a986aaae560141bf94f2213cdb5c18
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File datagenSimple.txt is 120.97 MB; this exceeds GitHub's file size limit of 100 MB
To git@github.com:DanielAdeniji/MSSSQLBCPFormatFileXML.git
! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'git@github.com:DanielAdeniji/MSSSQLBCPFormatFileXML.git'


 

To address tarred file and removed original file from our local git repository.

Tarred file

tar -zcvf datagenSimple.tar.gz datagenSimple.txt

Remove file

git rm --cached datagenSimple.txt

 

Added file (datagenSimple.txt) to .gitignore

vi .gitignore

 

Added file (datagenSimple.tar.gz) to local git

git add datagenSimple.tar.gz

 

Commit file

git commit

 

Output:

[dadeniji@adelia XML]$ git commit -a -m 'First commit'
[master ca1d17a] First commit
2 files changed, 0 insertions(+), 1160002 deletions(-)
create mode 100644 datagenSimple.tar.gz
delete mode 100644 datagenSimple.txt

 

Push to remove repository

git push origin master

 

But, it was still trying to add the removed file

 

remove file from history

Used David Underhill’s (git-remove-history.sh) file to permanently remove the oversized file.  BTW, the shell script is available @ http://dound.com/2009/04/git-forever-remove-files-or-folders-from-history/

 

sh /tmp/git-remove-history.sh datagenSimple.txt

Output:

[dadeniji@adelia XML]$ sh /tmp/git-remove-history.sh datagenSimple.txt
Rewrite 32420099006159bf2413542475cdc556167503d7 (1/2)rm 'datagenSimple.txt'
Rewrite ca1d17a66e273b7ceca218ff2e24392913b36d22 (2/2)
Ref 'refs/heads/master' was rewritten
Counting objects: 16, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (16/16), done.
Writing objects: 100% (16/16), done.
Total 16 (delta 5), reused 0 (delta 0)

 

Push to git

Syntax:


[dadeniji@adelia XML]$ git push origin master

Output:


[dadeniji@adelia XML]$ git push origin master
Counting objects: 12, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (12/12), 52.04 MiB | 691 KiB/s, done.
Total 12 (delta 2), reused 11 (delta 2)
remote: warning: GH001: Large files detected.
remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File datagenSimple.tar.gz is 52.02 MB; 
this is larger than GitHub's recommended maximum file size of 50 MB
To git@github.com:DanielAdeniji/MSSSQLBCPFormatFileXML.git
* [new branch] master -> master

 

References

BCP Reference

BCP Sample

Git Commands

 

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s