CSV Importing in C#

Reading and writing flat files was something that I had considered my bread and butter for quite some time. Many years ago, when I was still in high school and considered myself a master of Visual Basic 6, I was involved in a startup; if you could call it that, the goal of which was to set up an online review site of all things outdoors. Camping, hiking, rock climbing, etc. all included on one site. The idea being to allow people to not only get reviews of potential destinations, but to find out what was nearby at the same time. So, for example, upon selecting a potential campground, a site visitor might be able to see other activities within a certain distance. Nowadays I’m sure someone has pulled this idea off, but this isn’t a post about an outdoor activity review site, this is a post about reading CSV files. At that time, I was tasked with writing data-entry client software that could be used by other staff to input all of the information about a given location in discreet data fields; as well as a general description text area for free text. The application had to be usable offline, and then able to later synchronize their data with our database.

I hardly deserved to call myself a developer at this time, but confident in my knowledge I burst forth and wrote glorious VB6 code. I stored the data entered locally in a CSV file and all was well, until some nefarious (in my opinion) user dared to put a comma in their review. Of course this caused the database sync to fail miserably and all hope was lost. I then went about updating the clients code to perform a character replacement on any commas and put in a replacement character, which I would then scan for during the sync operation. All was fixed and glorious once again.

I tell this story because up until recently I was perfectly happy with this method of addressing what, to me, was a limitation of the CSV format. It was simple, fast, and easy to support and maintain. There are, of course, many solutions to this problem when you have full control over the file that is being written and read. You could use a different delimiter, possibly even one that the user would have a very difficult type typing with a regular keyboard; or a fixed width format – which works great as long as the client enforces the same length restrictions. But I’m wondering off topic again.

More recently, I finally encountered a project that required me to actually read CSV files that I did not create. That is, CSV files that someone might have exported from Excel, or Google Docs, or something similar. I no longer had control over the incoming format. Meaning I would have to load data that already had commas of its own within a field. But I’m getting ahead of myself.

Considering the following simple data set in Excel.
Simple excel dataset

When this data is saved at CSV file, here is what the resulting file contents look like.

Col1,Col2
bland,cake
text,is
is,good
always,especially
easy,new
to,york
import,cheesecake

This sort of data can be easily imported using logic similar to the following.

    class Program
    {
        static void Main(string[] args)
        {
            var MyObjects = new List<MyObject>();

            using (var sr = new System.IO.StreamReader(@".\csv_test1.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();
                    var fields = line.Split(',');

                    var obj = new MyObject(fields[0], fields[1]);

                    MyObjects.Add(obj);
                }

                sr.Close();
            }
        }
    }

    class MyObject
    {
        private string m_Col1;
        private string m_Col2;
        public string Col1 { get { return m_Col1; } }
        public string Col2 { get { return m_Col2; } }

        public MyObject(string col1, string col2)
        {
            this.m_Col1 = col1;
            this.m_Col2 = col2;
        }
    }

This is fairly similar to how I would have handled reading a CSV file in the past. Of course, having the benefit of nice clean data certainly helps. Lets look at another example.
A more complicated excel dataset

Saving this dataset as a CSV file and looking at the contents presents a somewhat different picture from before.

Col1,Col2
"2,500",cake
"$2,500.00 ",is
"""twenty-five hundred""",good
"I really like "" marks",especially
"I REALLY like """""" marks",new
"spreading "" them "" around "" here",york
"fake cell ending "", mid cell",cheesecake
"this cell
spans 2 lines",with strawberries

This demonstrates the finer details of what is acceptable in a CSV.

  • If the field contains no commas or quotation marks there’s nothing special that happens; the text is simply placed directly in the file without modification (See line 1 above)
  • If the field contains a comma the field is enclosed in quotation marks: ” (see line 2 above)
  • If the field contains a quotation mark the field is enclosed in quotation marks and the quotation marks within the field are doubled, this serves as a type of “escaping.” (see line 5 above)
  • If the field contains a line feed the field is enclosed in quotation marks (see lines 9 & 10 above)

Seems simple enough, but when I first looked into this I thought “surely there’s a good code snippet out there for this,” so off to Google I went. I found some interesting articles on using an ODBC text driver; but since I was developing within the context of a Metro Windows 8 Style UI application that idea was out. I found an assortment of pre-compiled libraries with various licensing strategies. I also found some suggestions to add a dependency on the Microsoft.VisualBasic namespace and use their TextParser, this also did not appeal to me. So I sat down at the drawing board and came up with a specification.

  1. If a field does not begin with a quotation mark the field ends once the next comma is located or the end of the line is reached
  2. If a field begins with a ” mark then the field ends once a single ” mark is identified since quotation marks within the field are doubled

My new code had to be more generic, capable of importing a CSV file regardless of the number of columns. It also had to work within the new Windows 8 Style-UI restrictions, which means I had to use Windows.Storage.FileIO rather than a StreamReader, however at the time of this writing I only have access to Visual Studio 2010 so bear with me.

The following code provides a self-contained class that imports the CSV file into a collection of Row objects, which are themselves simply a List of string values representative of each column. This is simply a proof of concept, deploying this code would likely involve a few tweaks; such as allowing for column headings, empty rows, a delimiter other than a comma, etc. However I hope this code will be useful to someone who may need some help getting started with their own CSV importing logic without potentially cumbersome licensing requirements.

    class Program
    {
        static void Main(string[] args)
        {
            var MyFile = new CSVFile(@".\csv_test2.csv");

            for (int row = 0; row < MyFile.Rows.Count; row++)
            {
                Console.WriteLine("Row {0}", row.ToString());

                for (int col = 0; col < MyFile.Rows[row].Fields.Count; col++)
                {
                    Console.WriteLine("Field {0}  -  {1}", col.ToString(), MyFile.Rows[row].Fields[col]);
                }

                Console.WriteLine();
            }


            Console.ReadKey();
        }
    }

    public sealed class CSVFile
    {
        private int m_ColumnCount;
        private List<CSVRow> m_Rows;

        public int ColumnCount { get { return m_ColumnCount; } }
        public System.Collections.ObjectModel.ReadOnlyCollection<CSVRow> Rows { get { return m_Rows.AsReadOnly(); } }


        public CSVFile(string path)
        {
            m_Rows = new List<CSVRow>();

            int curCharVal = 0;
            char curChar;
            bool inQuotes = false;
            var curField = new StringBuilder();
            var curRow = new CSVRow();

            try
            {
                using (var sr = new System.IO.StreamReader(path))
                {
                    curCharVal = sr.Read();

                    while (curCharVal >= 0)
                    {
                        //We can't be sure if the file we've received uses Line Feed (char 10) by itself, or Carriage Return / Line Feed (chars 13 / char 10) to indicate a new line
                        //what we can be sure of, however, is that we really don't care if there's a 13!
                        while (curCharVal == 13)
                        {
                            curCharVal = sr.Read();

                            if (curCharVal == -1)
                                break;
                        }

                        //Sanity check, if we ended up with a -1 due to curCharVal == 13 loop...
                        //Should never happen, but god knows what some people's CSV files look like
                        if (curCharVal == -1)
                        {
                            curRow.Fields.Add(curField.ToString());
                            curField.Clear();
                            this.m_Rows.Add(curRow);

                            break;
                        }

                        curChar = (char)curCharVal;

                        if (inQuotes)
                        {
                            //If we're in a quote enclosed field, we need to identify
                            //  if these new quotes are escaped (doubled)
                            //  and if they are, only add a single set of quotes to our
                            //  current field.  If they are not escaped, then we are
                            //  no longer in a quote enclosed field
                            if (curChar == '"')
                            {
                                curCharVal = sr.Read();

                                if (curCharVal >= 0)
                                {
                                    curChar = (char)curCharVal;

                                    if (curChar != '"')
                                    {
                                        inQuotes = false;
                                        
                                        //The new character we just imported (presumably a comma)
                                        //  will be handled once we fall through into the next if block below
                                    }
                                    else
                                    {
                                        curField.Append(curChar);
                                    }
                                }
                            }
                            else
                            {
                                curField.Append(curChar);
                            }
                        }
                        
                        //This is a separate if statement, rather than an else clause
                        //  because within the if block above, the inQuotes value could be
                        //  set to false, in which case we want to evaluate the logic
                        //  within this code block
                        if(!inQuotes)
                        {
                            if (curField.Length == 0 && curChar == '"')
                            {
                                inQuotes = true;
                            }
                            else if (curChar == ',')
                            {
                                curRow.Fields.Add(curField.ToString());
                                curField.Clear();
                            }
                            else if (curCharVal == 10)
                            {
                                curRow.Fields.Add(curField.ToString());
                                curField.Clear();

                                //We're done with this row, add it to the list and set
                                //  ourselves up for a fresh row.
                                this.m_Rows.Add(curRow);
                                curRow = new CSVRow();
                            }
                            else
                            {
                                curField.Append(curChar);
                            }
                        }


                        curCharVal = sr.Read();

                        //We just reached the end of the file.
                        //  Add the current row to the list of rows before the loop ends
                        if (curCharVal == -1)
                        {
                            curRow.Fields.Add(curField.ToString());
                            curField.Clear();
                        }
                    }
                }
            }
            catch
            {
                m_Rows.Clear();
                m_ColumnCount = 0;
            }
        }


        public sealed class CSVRow
        {
            private List<string> m_Fields;

            public List<string> Fields { get { return m_Fields; } }

            public CSVRow()
            {
                m_Fields = new List<string>();
            }
        }
    }

Here’s the output of the above code:

Row 0
Field 0  -  Col1
Field 1  -  Col2

Row 1
Field 0  -  2,500
Field 1  -  cake

Row 2
Field 0  -  $2,500.00
Field 1  -  is

Row 3
Field 0  -  "twenty-five hundred"
Field 1  -  good

Row 4
Field 0  -  I really like " marks
Field 1  -  especially

Row 5
Field 0  -  I REALLY like """ marks
Field 1  -  new

Row 6
Field 0  -  spreading " them " around " here
Field 1  -  york

Row 7
Field 0  -  fake cell ending ", mid cell
Field 1  -  cheesecake

Row 8
Field 0  -  this cell
spans 2 lines
Field 1  -  with strawberries

I’ve used code very similar to this to import 150+ MB CSV files in a matter of seconds without issue.

As always, if there are any questions; feel free to drop me a line. If you have a more elegant solution, I’d love to see that too!

Until next time.

-Alan

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s