Skip to main content

Reading a CSV file into a DataTable

If you don't share my opinion that CSV should be banned by an international convention, it might be because you've never had to write code to read a CSV file.

"Ah, but thats easy", you say, "Just use String.Split()". Well, that will work if your CSV file is an academic example that doesn't contain strings that were typed by real human beings. But real-world CSV files generally contain some or all of the following horrors:

- some strings enclosed in speechmarks, but some not
- empty fields
- speechmarks within strings
- commas within strings
- carriage returns within strings

To demonstrate, here is a CSV example that exhibits all of the traits listed above:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
Daisy,Smith,"Flat 6,
Porchester Court,
24 Dragon Road",Newcastle,NE3 4FG
(Note that the last line has two carriage returns in the Address, which is why it sprawls over three lines. Also speechmarks within speechmark-delimited strings have to be escaped somehow; in this example they are escaped by doubling them up.)

It is possible to deal with pathological cases of CSV using carefully tuned Regular Expressions. It is also possible to use an OdbcConnection and the Microsoft Text ODBC Driver.

But needing to read a CSV file is a very common affliction - surely someone has built an open-source framework to deal with it reliably? Surprisingly, such frameworks are quite hard to come by, but I did eventually find a brilliant one:

LumenWorks.Framework.IO.Csv by Sebastien Lorien

Downloadable from The Code Project, here:
A Fast CSV Reader by Sebastien Lorien

You can download the binaries or the source, its much faster than Regex or ODBC, and it can handle cases like the mangled CSV snippet shown above. It is also open-source, under the MIT Licence.

The design is based on the StreamReader class, and Sebastien gives a few example on the CodeProject page. However the examples mostly show how to bind a CSV straight to a DataGrid. So here is an example of how to use the library to read a CSV file into a DataTable:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;

using System.IO;
using LumenWorks.Framework.IO.Csv;

namespace CSVTest
{
    class LumenWorksCSVExample
    {

        /// <summary>
        /// Example code to fetch a CSV file and turn it into a DataTable
        /// using the LumenWorks.Framework.IO.Csv library
        /// </summary>
        /// <param name="filePath">The full path of the CSV file</param>
        /// <returns>A DataTable with all-string columns</returns>
        public DataTable FetchFromCSVFile(string filePath)
        {
            DataTable csvTable = new DataTable();
            using (CsvReader csvReader =
                new CsvReader(new StreamReader(filePath),true))
            {
                csvTable.Load(csvReader);
            }
            return csvTable;
        }
    }
}
So basically, you just use DataTable.Load(). Note that this example assumes that the CSV file has a header line, and that nothing is known about the column types. Hence all columns are stored in the DataTable as strings.

If Load() is a bit too automatic and you need more control over the loading of the rows, maybe to only get certain rows or columns, or to do column type conversions, this longer example might be more useful:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;

using System.IO;
using LumenWorks.Framework.IO.Csv;

namespace CSVTest
{
    class LumenWorksCSVExample2
    {
        /// <summary>
        /// Example code to fetch a CSV file and turn it into a DataTable
        /// using the LumenWorks.Framework.IO.Csv library
        /// Note: This example uses loops instead of DataTable.Load()
        /// </summary>
        /// <param name="filePath">The full path of the CSV file</param>
        /// <returns>A DataTable with all-string columns</returns>
        public DataTable FetchFromCSVFileLong(string filePath)
        {
            char delimiter = ',';
            bool hasHeader = true;
            DataTable csvTable = new DataTable();
            using (CsvReader csvReader =
                new CsvReader(new StreamReader(filePath), hasHeader, delimiter))
            {

                int fieldCount = csvReader.FieldCount;
                string[] headers = csvReader.GetFieldHeaders();

                // this bit could be modified to fine-tune the columns
                foreach (string headerLabel in headers)
                    csvTable.Columns.Add(headerLabel, typeof(string));

                while (csvReader.ReadNextRecord())
                {
                    DataRow newRow = csvTable.NewRow();
                    // this bit could be modified to do type conversions, skip columns, etc
                    for (int i = 0; i < fieldCount; i++)
                        newRow[i] = csvReader[i];

                    csvTable.Rows.Add(newRow);
                    
                }
            }
            return csvTable;
        }
    }
}
Of course, on its own this second example is pretty pointless, but you might be able to use it as a baseline to develop something a bit more tailored.

Comments

Popular posts from this blog

Copying data to Salesforce Sandboxes using TalenD

A common problem with Salesforce Developer Sandboxes is that they are blank. Really you're going to want some data in there, so there are various strategies for copying data from your live instance to the Sandbox. There are some paid-for solutions - SFXOrgData , Salesforce Partial Data Sandboxes - but if you've got a decent ETL tool you can build your own. There are a bunch of free ETL tools for Salesforce: JitterBit Data Loader is good for quick ad-hoc tasks but the free version makes it difficult to manage specific ETL projects or share projects with other users Pentaho Community Edition - an open source edition of the enterprise version Apatar was a free open source Salesforce ETL which still works but development seems to have stopped since 2011 TalenD Open Studio is an open source ETL tool For the task of copying data from live to a Sandbox, either Pentaho or TalenD Open Studio could be used, depending on preference. Here's a good comparison of the dif

SSRS multi-value parameters with less fail

SSRS supports multi-value parameters, which is nice, but there are a few issues with them. This is how I deal with them. Two of the problems with SSRS multi-value parameters are: You have to jump through a few hoops to get them to work with stored procedures The (Select All) option, as shown above The reason the (Select All) option is a problem is that it is a really inelegant way of saying 'this parameter does not matter to me'. If you have a list with hundreds of values, passing all of them as a default option just seems wrong. Also, if your report shows the user which items they selected, printing the whole list when they choose (Select All) is excessive. So in this post I'm going to show my particular way of: Jumping through the hoops to get Multi-Value params in a stored procedure Adding a single '--All--' value that the report interprets as meaning all the options. Getting Multi-Value params to work with Stored Procedures This is

Remote Desktop on High DPI screens

Scott Hanselman wrote a nice blog post back in January about some of the issues you might face running Windows on a High DPI screen like that of a Surface Pro or Lenova Yoga. I'm kindof mystified that he didn't mention Remote Desktop though because thats been the number one problem for me on High DPI screens. That said, if you remote into very recent Server OS's like Windows Server 2012 R2, then apparently Remote Desktop will sort out the DPI scaling automatically. Perhaps Scott hadn't noticed the Remote Desktop issue because he only remotes into Server 2012 R2. Certainly if I was Scott, I wouldn't remote into anything less than that. But, in practice, I regularly have to remote into Server 2008 machines and yes even Server 2003. If you do that from a high DPI screen, the remote desktop is rendered at regular pixel size, which makes everything tiny. Its hard to show screenshots of High DPI screens that correctly convey the pixel size, so I've photoshoppe