code-u-like

C#, .Net, Sql Server, Salesforce, Dynamics CRM etc,
Charity and Not-for-Profit sector

Tuesday, May 19, 2009

Reading a CSV file into a DataTable

If you don't share my opinion that CSV should be banned by an international convention, it might be because you've never had to write code to read a CSV file.

"Ah, but thats easy", you say, "Just use String.Split()". Well, that will work if your CSV file is an academic example that doesn't contain strings that were typed by real human beings. But real-world CSV files generally contain some or all of the following horrors:

- some strings enclosed in speechmarks, but some not
- empty fields
- speechmarks within strings
- commas within strings
- carriage returns within strings

To demonstrate, here is a CSV example that exhibits all of the traits listed above:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
Daisy,Smith,"Flat 6,
Porchester Court,
24 Dragon Road",Newcastle,NE3 4FG
(Note that the last line has two carriage returns in the Address, which is why it sprawls over three lines. Also speechmarks within speechmark-delimited strings have to be escaped somehow; in this example they are escaped by doubling them up.)

It is possible to deal with pathological cases of CSV using carefully tuned Regular Expressions. It is also possible to use an OdbcConnection and the Microsoft Text ODBC Driver.

But needing to read a CSV file is a very common affliction - surely someone has built an open-source framework to deal with it reliably? Surprisingly, such frameworks are quite hard to come by, but I did eventually find a brilliant one:

LumenWorks.Framework.IO.Csv by Sebastien Lorien

Downloadable from The Code Project, here:
A Fast CSV Reader by Sebastien Lorien

You can download the binaries or the source, its much faster than Regex or ODBC, and it can handle cases like the mangled CSV snippet shown above. It is also open-source, under the MIT Licence.

The design is based on the StreamReader class, and Sebastien gives a few example on the CodeProject page. However the examples mostly show how to bind a CSV straight to a DataGrid. So here is an example of how to use the library to read a CSV file into a DataTable:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;

using System.IO;
using LumenWorks.Framework.IO.Csv;

namespace CSVTest
{
    class LumenWorksCSVExample
    {

        /// <summary>
        /// Example code to fetch a CSV file and turn it into a DataTable
        /// using the LumenWorks.Framework.IO.Csv library
        /// </summary>
        /// <param name="filePath">The full path of the CSV file</param>
        /// <returns>A DataTable with all-string columns</returns>
        public DataTable FetchFromCSVFile(string filePath)
        {
            DataTable csvTable = new DataTable();
            using (CsvReader csvReader =
                new CsvReader(new StreamReader(filePath),true))
            {
                csvTable.Load(csvReader);
            }
            return csvTable;
        }
    }
}
So basically, you just use DataTable.Load(). Note that this example assumes that the CSV file has a header line, and that nothing is known about the column types. Hence all columns are stored in the DataTable as strings.

If Load() is a bit too automatic and you need more control over the loading of the rows, maybe to only get certain rows or columns, or to do column type conversions, this longer example might be more useful:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;

using System.IO;
using LumenWorks.Framework.IO.Csv;

namespace CSVTest
{
    class LumenWorksCSVExample2
    {
        /// <summary>
        /// Example code to fetch a CSV file and turn it into a DataTable
        /// using the LumenWorks.Framework.IO.Csv library
        /// Note: This example uses loops instead of DataTable.Load()
        /// </summary>
        /// <param name="filePath">The full path of the CSV file</param>
        /// <returns>A DataTable with all-string columns</returns>
        public DataTable FetchFromCSVFileLong(string filePath)
        {
            char delimiter = ',';
            bool hasHeader = true;
            DataTable csvTable = new DataTable();
            using (CsvReader csvReader =
                new CsvReader(new StreamReader(filePath), hasHeader, delimiter))
            {

                int fieldCount = csvReader.FieldCount;
                string[] headers = csvReader.GetFieldHeaders();

                // this bit could be modified to fine-tune the columns
                foreach (string headerLabel in headers)
                    csvTable.Columns.Add(headerLabel, typeof(string));

                while (csvReader.ReadNextRecord())
                {
                    DataRow newRow = csvTable.NewRow();
                    // this bit could be modified to do type conversions, skip columns, etc
                    for (int i = 0; i < fieldCount; i++)
                        newRow[i] = csvReader[i];

                    csvTable.Rows.Add(newRow);
                    
                }
            }
            return csvTable;
        }
    }
}
Of course, on its own this second example is pretty pointless, but you might be able to use it as a baseline to develop something a bit more tailored.

Labels: ,

Tuesday, May 12, 2009

StackOverflow


I've spent the last couple of weeks hanging out on StackOverflow, and thinking about how successful its design is.

Incase you haven't heard of it, StackOverflow is a question-and-answer site for programmers, put together by two popular programming bloggers, Joel Spolsky and Jeff Atwood. Participants get given 'reputation points' when their questions or answers are voted-up by other people. This makes the site very useful because when you ask a question there are loads of programmers online practically falling over each other to answer your question. I guess programmers all like to get positive feedback from their peers, and thats what the vote system provides.

The reputation system is just one design point out of several that Joel and Jeff obviously thought very carfully about when building the site. A few weeks ago Joel gave a googletech talk about StackOverflow, going into detail about how they designed it:



He makes the interesting point in this video that Human-Computer interaction is now quite well understood, and that Human-to-Human interaction on the web is a more challenging area now. As in the real world, the environment affects people's behaviour, and so the UI choices made on a website will greatly influence how people use the site.

He says that Microsoft estimate that there are about 9 million professional programmers in the world, and as of April 2009 StackOverflow is getting about 3 million distinct visitors a month. So just 6 months after launch they have 33% of their total possible audience in attendance.

Towards the end of the talk he also has some interesting observations on why the StackOvervlow approach might not translate well to non-technical domains. He says
"I dont think this site is going to work for gardening questions, ever". For example, because they could take a certain amount of technical know-how for granted in their userbase, they were able to make great use of features like tagging without having to explain it to anyone.

The video is about an hour long but worth watching if you want some insight into the design of a great site.

My username on StackOverflow is codeulike, and today I achieved the little milestone of 1,000 reputation points, which also inspired me to make this post ; )