C#, .Net, Sql Server, Salesforce, Dynamics CRM etc,
Charity and Not-for-Profit sector

Sunday, January 25, 2015

Automating Salesforce Data Imports with Powershell

If you're a programmer and you're doing lots of Salesforce Data Imports, at some point you'll get fed up of messing around with Apex Data Loader and CSV files and look for better options. You might be doing incremental migration and so want to re-run and evolve the migration as the target system evolves. Or you might just want to get the migration automated for scheduling reasons.

There are various 'integration' systems that could help with this: JitterBit, Pentaho, TalenD, and so on. But if you're fond of writing code you might want to work with something a bit more low-level where you can write specific code to get the exact transformations you want without learning the vagaries of a new integration package.

Salesforce has a Bulk Data API so you might be tempted to look at that. But its quite heavy duty, if you try and use it you'll soon have to deal with batches and async callbacks and you'll essentially end up re-creating quite a lot of the code of the Apex Data Loader. The Bulk Data API is really meant for people writing generalised Integration Tools like those mentioned above, rather than people doing specific migrations into a specific system.

But the Apex Data Loader itself can be automated, so one way to combine the precise control of 'code' while importing into Salesforce is to automate the Apex Data Loader with Powershell.

The first part of this post deals with configuring the Data Loader for Automation, then the second part deals with the Powershell angle.

Setting Up Automation with the Data Loader

When you automate the Data Loader lots of files are needed and generated, so it helps to have a good folder structure. Here's what I used:

  • Data is where the source data (CSV) files will end up
  • DataLoaderConfig is where the two main config files for the data loader will be stored
  • Logs is where the Data Loader will output its logs to
  • Maps is where the mapping files for each CSV/object will be stored
  • The root folder (ExampleMigration in this case) will hold the Powershell scripts

Data Loader Config files

The Data Loader needs two files. The simpler of the two is and it just contains the login info and default settings that you want to use:

# Data Loader Config
# defaults for all runs

# login settings:
# password encrypted by C:\Program Files\\Data Loader\bin\encrypt.bat 
# run encrypt -e passwordandtoken 

# other settings
# for UTF8 files:

The endpoint and username settings are as with all uses of the Salesforce API - use for sandboxes and for live. The password needs to be encrypted by a command line tool that comes with the Data Loader. This can be done as follows:
  • On the server go to Start/Run and type cmd to get the dos prompt
  • type "cd C:\Program Files\\Data Loader\bin"
  • type "encrypt -e (password)(securitytoken)"
    where password and securitytoken are concatenated together, e.g.
    "encrypt -e tomato2FGTN3RE26yP0GYdfXD3A4Ln"
  • The command will output an encrypted password string.
  • To get that from the DOS window, right click do "Select All" and then "Copy", then paste the result into Notepad
  • In Notepad you will then have the text of the whole DOS session - extract the encrypted password string and paste it into the file as above
 Most of the other settings are fairly obvious, except perhaps:
  •  process.useEuropeanDates will read dates from the CSV in dd/MM/yyyy format instead of USA MM/dd/yyyy
  •  process.lastRunOutput = false prevents the Data Loader from creating temporary files that it uses to track last run dates etc. 
  • dataAccess.readUTF8 (commented out above) can be used if you want to have UTF8 unicode chars in your CSV file. See also notes about generating UTF8 CSV files below.
The other config file is process-conf.xml and it is used to keep info about the actual 'jobs' that you want to automate, each job being given a distinct name. It is based on the Java spring framework and looks like this:


<!-- NB, endpoint and login details set in -->
   <bean id="myJobName"
        <property name="name" value="myJobName"/>
        <property name="configOverrideMap">
                <entry key="sfdc.entity" value="MyCustomObject__c"/>
  <entry key="sfdc.externalIdField" value="Something_Id__c"/>
                <entry key="process.operation" value="upsert"/>
                <entry key="process.mappingFile" value="C:\ExampleMigration\Maps\map_mycustomobject__c.sdl"/>
                <entry key="process.outputError"   value="C:\ExampleMigration\Logs\error_myJobName.csv"/>
                <entry key="process.outputSuccess" value="C:\ExampleMigration\Logs\success_myJobName.csv"/>
                <entry key="" value="C:\ExampleMigration\Data\mycustomobject__c.csv" /> 
                <entry key="dataAccess.type" value="csvRead" /> 
   <bean id="myOtherJobName"
        <property name="name" value="myOtherJobName"/>
        <property name="configOverrideMap">
                <entry key="sfdc.entity" value="MyOtherCustomObject__c"/>
  <entry key="sfdc.externalIdField" value="SomethingElse_Id__c"/>
                <entry key="process.operation" value="upsert"/>
                <entry key="process.mappingFile" value="C:\ExampleMigration\Maps\map_myothercustomobject__c.sdl"/>
                <entry key="process.outputError"   value="C:\ExampleMigration\Logs\error_myOtherJobName.csv"/>
                <entry key="process.outputSuccess" value="C:\ExampleMigration\Logs\success_myOtherJobName.csv"/>
                <entry key="" value="C:\ExampleMigration\Data\myothercustomobject__c.csv" /> 
                <entry key="dataAccess.type" value="csvRead" /> 

Both jobs in this example are upserts. The job names for the two jobs are in bold (note that you have to give the name twice). Quite a lot of it is boilerplate, its only the settings within the <map> tag that are actually important.

  • sfdc.entity: the entity you are importing into
  • sfdc.externalIdField: if you are doing an upsert, the name of the external id field to use
  • process.operation: upsert, insert, update or extract
  • process.mappingFile: the location of an .sdl file (see below) that specifies the exact mapping from CSV fields to Salesforce fields
  • location of the source CSV file
  • dataAccess.type: csvRead when reading a file for import, csvWrite if exporting
  • process.outputError and process.outputSuccess: location of the two CSV files that Data Loader creates to document successes and failures
Note that if you are trying to export data instead of import, the <map> section looks more like this:

        <entry key="sfdc.entity" value="user"/>
        <entry key="process.operation" value="extract"/>
 <entry key="sfdc.extractionSOQL" value="Select id, username, name, email, profileID, IsActive From user" />
        <entry key="" value="C:\ExampleMigration\Data\extract_user.csv" /> 
        <entry key="dataAccess.type" value="csvWrite" /> 

Mapping Files

For each CSV file you import, you'll need a mapping file.

You can generate mapping files by running the actual Data Loader and creating a mapping and then saving it as a file. However the actual format is very simple so you can also code them up yourself if you want.

example mapping (.sdl) file

#Mapping values

Basically you put the CSV column names on the left and the Salesforce field names on the right.

If you are setting a relationship using an external ID on the other object, then the format to use is:


See the Contact__c example above.

If there is something wrong with your mapping file, you will get error messages but probably not very intelligible ones. So watch out for trailing spaces or commas, and note that the Salesforce field names (on the right of the equals sign) are case sensitive.

Running the Data Loader from Powershell

So if you have everything set up as above (you've sorted out your folder structure, done the two config files, encrypted your password, created a mapping file, put your CSV file in the right place, set up some jobs in process-config.xml and given them names) you can now run the Data Loader from powershell.

If you havent used Powershell before: Powershell usually comes installed on any recent Windows machine but initially it is (usually) set to only run signed scripts as a security measure. So before running any scripts of your own you need to relax that restriction a bit by telling it that its OK to run local scripts that aren't signed. This can be done by:
  • Starting "Windows PowerShell ISE"
  • executing this command: "Set-ExecutionPolicy RemoteSigned"
The Data Loader itself can be run from the CMD command line by using the process.bat script that comes with it.  This is well documented elsewhere. To run the Data Loader from Powershell, we need to execute that same process.bat script. This can be done as follows:

$dataloader = "C:\Program Files\\Data Loader\bin\process.bat"
$dlconfig = "C:\ExampleMigration\DataLoaderConfig"
set-location "C:\Program Files\\Data Loader\bin\"

& $dataloader $dlconfig myJobName

Line, by line, what this does is:
  • put the location of process.bat into the $dataloader variable
  • put the folder with the config files (see above) into the $dlconfig variable
  • set the working directory to the data loader bin folder
  • the ampersand on the fourth line tells powershell to execute the following command. The command that follows is process.bat and two command line arguments - the config location and the name of the job to run
Note that 'myJobName' must be properly defined in process-conf.xml (see above)
Also note that on 64-bit systems you'll need to change 'Program Files' to 'Program Files (x86)' in the file paths.

If you put the above into a script in PowerShell ISE and then run it you should then see a stream of output from the Data Loader as it runs (it is very verbose in its log output). To see how the import went, go and look at the success and error .csv log files that get created.

So why bother running the Data Loader from Powershell when all we're doing is running an old batch file anyway? Well the point is, now we are in the Powershell environment we can also do a load of other things, like automatically checking for import errors, and pulling SQL data into CSV files.

Powershell: Checking the error log output

After each Data Loader run, success and error csv files are created. There'll be a log entry in the powershell output telling you how many rows failed, but there's so much other output that it gets a bit drowned out. Here's a Powershell function that checks the error.csv file to see if there's anything in it, and if there is, prints a red warning in the powershell output:

# function to check the CSV files generated by data loader for errors
function CheckErrorFile {

    Param ([string]$perrorfilepath)

    $csvlines = (Get-Content $perrorfilepath | Measure-Object -line).Lines
    # header is always 1 row so we look for > 1 rows
    if ($csvlines -gt 1)
        $errorCount = $csvLines - 1
        $errorMessage = [string]::Format("** errors ** {0} failed rows in {1}", $errorCount, $perrorfilepath)
        # this will print red error message but not stop execution
        write-error $errorMessage

To use it, add it to the top of your Powershell script and then call it after each run (or at the end of a series of runs) like this:

$dataloader = "C:\Program Files\\Data Loader\bin\process.bat"
$dlconfig = "C:\ExampleMigration\DataLoaderConfig"
set-location "C:\Program Files\\Data Loader\bin\"
& $dataloader $dlconfig myJobName

$dlLogFolder = "C:\ExampleMigration\Logs\" 
CheckErrorFile ("{0}error_myJobName.csv" -f $dlLogFolder)
Note that you have to tell it exactly where to look for the CSV error file, because the location of it is determined by what you previously set up in process-conf.xml.

Powershell: Defining functions in a separate file

In Powershell you can define a function then call it in the same file but that quickly gets annoying. You'll probably want to share functions between multiple scripts, etc.

A quick way to share functions in Powershell is to define all your commonly-used functions in one powershell file, lets call it 'CommonFunctions.ps1'. You can then 'include' that file in other files by using the Powershell 'dot include' syntax:

. C:\ExampleMigration\CommonFunctions.ps1

# we can then call functions in this script ...

Note that powershell files that are 'dot included' in this way can also add variables into your script as well as functions. One use for that is to define connection strings in one script and then 'dot include' that script elsewhere so you don't have to keep defining the same variables.

Powershell: SQL to CSV to Data Loader

CSV files are a necessary evil but its likely that the real data you're after is in SQL and the CSV is just a stepping stone. Its possible to get the Apex Data Loader to connect directly to a database, but frankly setting that up looks like a bit of a Java/JDBC/XML nightmare.

With Powershell its pretty easy to go from SQL Server to CSV. This example is using ADO.NET to fetch the data so could probably be easily adapted for databases other than MS Sql

# function to execute query and write results to CSV
function SqlToCsv {

    Param ([string]$pconnectionString, [string]$pquery, [string]$pextractFolder, [string]$pextractFile)

    $extractPath = [string]::Format("{0}{1}", $pextractFolder, $pextractFile)

    $connection = New-Object System.Data.SqlClient.SqlConnection
    $connection.ConnectionString = $pconnectionString

    $command = New-Object System.Data.SqlClient.SqlCommand
    $command.CommandText = $pquery
    $command.Connection = $connection

    $SqlAdapter = New-Object System.Data.SqlClient.SqlDataAdapter
    $SqlAdapter.SelectCommand = $command
    $DataSet = New-Object System.Data.DataSet

    # dump the data to a csv
    $DataSet.Tables[0] | Export-Csv $extractPath –notypeinformation

You call it like this:
$connectionString = "Data Source=SomeSqlServer;Initial Catalog=SomeDatabase;User Id=SqlUsername;Password=SqlPassword;"
$extractFolder = "C:\ExampleMigration\Data\"

SqlToCsv $connectionString  "select * from some_table "   $extractFolder  "mycustomobject__c.csv"

You'll want to set the $extractFolder and CSV file name to match the file that the Data Loader is looking for as defined in the process-conf.xml (see above).

Note that the CSV files produced are pretty robust (e.g. all strings are enclosed in speechmarks, speechmarks within the strings are properly escaped, etc).

If you're dealing with unicode data then you'll want to create a UTF8 CSV file, this can be done by modifying the last line of the above SqlToCsv function, like so:

    $DataSet.Tables[0] | Export-Csv $extractPath –notypeinformation -encoding "UTF8"

In this scenario you'll also want to set "dataAccess.readUTF8=true" in the Data Loader file - see above.

Everything so far, again

So here's a full example tying all the Powershell stuff together.

Its assuming:
  • you've set up your config files and mapping files as described above.
  • you've defined a file called "C:\ExampleMigration\CommonFunctions.ps1" that contains the SqlToCsv and CheckErrorFile functions above
# include our functions 
. C:\ExampleMigration\CommonFunctions.ps1

# set up locations/variables for data loader
$dataloader = "C:\Program Files\\Data Loader\bin\process.bat"
$dlconfig = "C:\ExampleMigration\DataLoaderConfig"
$dlLogFolder = "C:\ExampleMigration\Logs\" 
set-location "C:\Program Files\\Data Loader\bin\"

# set up locations/connection strings for SQL to CSV
$connectionString = "Data Source=SomeSqlServer;Initial Catalog=SomeDatabase;User Id=SqlUsername;Password=SqlPassword;"
$extractFolder = "C:\ExampleMigration\Data\"

# get the SQL data into CSV
SqlToCsv $connectionString  "select * from some_table "   $extractFolder  "mycustomobject__c.csv"

# execute the data loader
& $dataloader $dlconfig myJobName

# check for errors in the error.csv file
CheckErrorFile ("{0}error_myJobName.csv" -f $dlLogFolder) 


Its not particularly pretty, and yes there's a lot of folder locations that need to be defined, but if you go down this route, the advantages are:
  • Free (assuming you have a windows box to hand)
  • Reasonably efficient because the Apex Data Loader can use the Bulk Data API
  • Very fine control over the mappings and external IDs used to import data
  • You can do all your transformations using SQL instead of fiddly visual box things in some integration package
  • Powershell will let you do a multitude of other things if needed
  • Everything is either a config file, a SQL file or a powershell file and so can be checked in to source control

Labels: , ,