Tag Archives: C#

Quick Post: List<int> Intersect Performance

I noticed an interesting performance difference earlier today, so I thought I’d throw together a quick post about it just in case it could be useful for anyone else.  In a project that I’m working on, I have the need to do a good amount of intersect operations on separate collections of integers.  I happened to notice that the performance of this was affected by the order in which the collections were specified.  So lets jump right in and take a look.  I’m doing all of my tests with Visual Studio 2013 in a .NET 4.5 console application compiling for x64.  CPU: Intel Xeon E3-1230 V2 with HT enabled.

using System;
using System.Collections.Generic;
using System.Linq;

namespace CompareSpeed
{
    class Program
    {
        static void Main(string[] args)
        {
            int largeColSize = 10000;
            int smallColSize = 500;
            int iterations = 25000;

            Console.WriteLine("Total Rows: {0}", (largeColSize + smallColSize).ToString("#,0"));
            Console.WriteLine("Ratio of small to large: {0}", (decimal)smallColSize / (decimal)largeColSize);
            Console.WriteLine();

            var rand = new System.Random();
            var largeCollection = new List<int>(largeColSize);
            var smallCollection = new List<int>(smallColSize);

            //Fill large collection
            for (int i = 0; i < largeColSize; i++)
            {
                largeCollection.Add(rand.Next(int.MinValue, int.MaxValue));
            }

            //Fill small collection
            for (int i = 0; i < smallColSize; i++)
            {
                smallCollection.Add(rand.Next(int.MinValue, int.MaxValue));
            }


            var sw = new System.Diagnostics.Stopwatch();

            /*  CODE
             *  GOES
             *  HERE
             *  BASED
             *  ON
             *  TEST
             */

            Console.ReadKey();
        }
    }
}

One of the following code blocks gets placed in the commented section above based on which test I was running.

//Intersect large to small, single-threaded
sw.Start();
for (int i = 0; i < iterations; i++)
    largeCollection.Intersect(smallCollection).ToList();
sw.Stop();
Console.WriteLine("Intersect Lg->Sm Single Thread:\t {0}", sw.ElapsedMilliseconds);

//Intersect small to large, single-threaded
sw.Start();
for (int i = 0; i < iterations; i++)
    smallCollection.Intersect(largeCollection).ToList();
sw.Stop();
Console.WriteLine("Intersect Sm->Lg Single Thread:\t {0}", sw.ElapsedMilliseconds);

//Intersect large to small, parallel
sw.Start();
for (int i = 0; i < iterations; i++)
    largeCollection.AsParallel().Intersect(smallCollection.AsParallel()).ToList();
sw.Stop();
Console.WriteLine("Intersect Lg->Sm Parallel:\t {0}", sw.ElapsedMilliseconds);

//Intersect small to large, parallel
sw.Start();
for (int i = 0; i < iterations; i++)
    smallCollection.AsParallel().Intersect(largeCollection.AsParallel()).ToList();
sw.Stop();
Console.WriteLine("Intersect Sm->Lg Parallel:\t {0}", sw.ElapsedMilliseconds);

Here are the results.  All time are in milliseconds.  Green indicates shortest duration, orange indicates longest duration.

Results

Based on this testing, I think there’s a strong case to be made for performing an intersect using the larger collection on the left side.

Using the Managed Extensibility Framework (MEF) to implement a third-party plugin solution. Part 3 – Developing the plugins

Now that my application supports plugins, I’m ready to publish my plugin SDK out for other developers to make use of.  If I take a look in my .\bin\Debug folder for the MyPluginApp project (yes, I’m staying in Debug mode for the purposes of this example) I see the following.

Debug Folder Contents

MyPluginApp.exe of course is my Windows Forms Application executable, but in addition to that there is a file called MyPluginApp.Plugins.dll.  This .dll is the compiled version of the Class Library that I added to my solution in Part 2 named MyPluginApp.Plugins.  This is the .dll that I will need to make available to 3rd party developers (henceforth I will refer to this as my “SDK”).  Along, of course, with some thorough documentation outlining what exactly I am expecting them to implement and what metadata I am expecting.

Now I am going to create a new solution from the perspective of a 3rd party developer who wants to develop a plugin for my application.  I start by creating a new “Class Library” project, because I want my code to compile to a .dll that I can place in the Plugins directory for my application.  I’m going to call my new solution MyPluginApp_StarterPlugins.

Once my new solution and project have been created, I take the MyPluginApp.Plugins.dll file that I downloaded as part of the application developer’s SDK *wink wink* and place it in my project directory.  I then add the file to my project, making sure to set the Build Action to ‘None’ and Copy to Output Directory to ‘Do Not Copy.’  The step of explicitly adding the .dll file to my solution is not required, however I think it adds some clarity to what the purpose of my project is.  Now that I’ve physically gotten the SDK into the appropriate place, I need to add a reference to it within the project.  I do this by adding a reference to the actual .dll file that I just moved into my project’s directory.  Also, just to keep Visual Studio from copying the .dll to my bin\{Debug}{Release} folders, I will set “Copy Local” property of the reference to false.  I also need to add a reference to System.Windows.Forms because my intention is to use MessageBox when my plugins perform their tasks.  Finally, I need to add a reference to System.ComponentModel.Composition in order to gain access to the necessary MEF namespaces.

Project Directory

Project Directory Contents

Solution Explorer Containing File

Solution Explorer View

Reference

Solution Explorer References

I’m going to develop two simple plugins within this project.  The first will be called AdditionPlugin, and will add the two numbers together.  The second will be called MultiplicationPlugin, and will multiply the two numbers together.  These plugins will both impliment the IPlugin interface which we defined in the MyPluginApp.Plugins projects and is available to me as a 3rd party developer within the .dll that I have added to my project.

Now I’m ready to define my first plugin.  First I will create the AdditionPlugin.  I rename the Class1.cs default file to AdditionPlugin.cs and replace its contents with the following:

using System;

using System.ComponentModel.Composition; //MEF namespace for Exporting
using System.Windows.Forms; 

namespace MyPluginApp.Plugins
{
    [Export(typeof(IPlugin))] //**1
    [ExportMetadata("Name", "Addition Plugin")] //**2
    public class AdditionPlugin : IPlugin
    {
        public void HandleNumbers(decimal num1, decimal num2)
        {
            MessageBox.Show(string.Format("{0} + {1} = {2}", num1, num2, num1 + num2), "Addition Plugin");
        }
    }


    
    //**1 - This will allow the CompositionContainer in MyPluginApp know that this class is of type IPlugin, which will be loaded 
    //  when we call .GetExports<IPlugin, IPluginMetaData() in the LoadPlugins() method.

    //**2 - This is how MetaData is exported by the Plugin.  The exported metadata needs to match with the metadata interface we defined in MyPluginApp.
    //  In this case we have only 1 defined field in IPluginMetaData called "Name", so that is the only peice of metadata that we want to export.
}

Now if this plugin is enabled (remember the Checked List Box in MyPluginApp?), its HandleNumbers method will be added to the MyButtonClicked event, which gets Invoked every time the button on my form gets clicked.

Similarly to the AdditionPlugin, I add a new .cs file to the project and name it MultiplicationPlugin.cs.  Here are the contents of that file:

using System;
using System.ComponentModel.Composition; //MEF namespace for Exporting
using System.Windows.Forms;

namespace MyPluginApp.Plugins
{
    [Export(typeof(IPlugin))]
    [ExportMetadata("Name", "Multiplication Plugin")]
    public class MultiplicationPlugin : IPlugin
    {
        public void HandleNumbers(decimal num1, decimal num2)
        {
            MessageBox.Show(string.Format("{0} * {1} = {2}", num1, num2, num1 * num2), "Multiplication Plugin");
        }
    }
}

Now I can build my solution.  After the build is complete, I find the following in MyPluginApp_StarterPlugins bin\Debug folder:

image

MyPluginApp_StarterPlugins.dll needs only to be placed in the \Plugins directory of MyPluginApp prior to MyPluginApp being started up.  Here is what MyPluginApp looks like after I do just that.

MyPluginApp with loaded plugins

Clicking on MyButton at this point still yields the familiar output from before Plugin support was added.

MyPluginApp Output no plugins enabled

Clicking OK on the MessageBox dialog returns me to the application.  Now I change Number 1 to “8,” check both boxes, and then click MyButton.

MyPluginApp Enabled Plugins

MyPluginApp - App Output

MyPluginApp - Addition Plugin Output

MyPluginApp - Multiplication Plugin Output

Unchecking the box next to a plugin prior to clicking MyButton again causes that plugin to become unsubscribed from the event, so its MessageBox no longer pops up.

I’ve certainly only begun to scratch the surface of MEF in this series of posts.  But I hope that this simple tutorial showing a method of developing plugin .dll files completely independently of your original application’s code/solution will be of some value to someone.

I have posted a .zip file containing both complete solutions Here.

Using the Managed Extensibility Framework (MEF) to implement a third-party plugin solution. Part 2–Plugin Support in the Application

Now that I have a feature-complete application from my perspective, I am ready to add plugin support so that other developers can extend the functionality of my application.  I don’t want to make my application open source, so I want to just expose enough information about my application to them that they are able to create plugins.

In order to give my application access to MEF, I need to add a reference to the appropriate namespace within my windows forms application.  The namespace I need to reference is System.ComponentModel.Composition.  Once I’ve done that, I need to add a new project to my solution of type “Class Library.”  This project will be my “Plugin API” and will compile to a .dll file that I can distribute to developers as a part of my SDK so that they can create plugins that adhere to my standard.  After creating my new project, I need add a reference to that project from within my Windows Forms project.

At this point, my application solution looks like this:

Solution Explorer

Here are the contents of IPlugin.cs in the MyPluginApp.Plugins project:

using System;

namespace MyPluginApp.Plugins
{
    //This interface defines what we expect a plugin developer to impliment in order for our application to make use of their plugin
    public interface IPlugin
    {
        //This method's signature matches the MyButtonClickedEventHandler in our application
        void HandleNumbers(decimal num1, decimal num2);
    }

    //MEF allows for metadata to be exported along with the object itself, we need a separate interface to receive any metatdata that we expect
    // plugin developers to include as well.
    public interface IPluginMetaData
    {
        string Name { get;}
    }
}

Next I need to update my UI so that it is able to display the list of available plugins and allow the user enable/disable them at will.  I’ve opted to use a checked list box to achieve that goal.  The plugins will load initially disabled, and the user simply needs to check the box next to the plugin in order to enable it, similarly unchecking the box will cause the plugin to be disabled.

Here is my New UI

MyPluginApp

And now the code for the form, including the code to enable plugin support.  New lines of code are highlighted for your convenience.

using System;
using System.Windows.Forms;

using System.Collections.Generic;
using MyPluginApp.Plugins;  //Our plugin namespace
using System.ComponentModel.Composition.Hosting; //MEF namespace

namespace MyPluginApp
{
    public partial class Form1 : Form
    {
        private delegate void MyButtonClickedEventHandler(decimal num1, decimal num2);
        private event MyButtonClickedEventHandler MyButtuttonClicked;

        private readonly Dictionary<string, IPlugin> _plugins = new Dictionary<string, IPlugin>();

        public Form1()
        {
            InitializeComponent();

            this.LoadPlugins();
        }

        //the Leave event handler for both text boxes
        private void textNum_Leave(object sender, EventArgs e)
        {
            decimal d;
            var textNum = (TextBox) sender;

            if (!decimal.TryParse(textNum.Text, out d))
            {
                textNum.Text = "5";
            }
        }

        //The KeyDown event handler for both text boxes
        private void txtNum_KeyDown(object sender, KeyEventArgs e)
        {
            if (e.KeyData == Keys.Enter || e.KeyData == Keys.Return)
            {
                btnMyButton.Focus(); //fires the Leave event for the active text box
                btnMyButton.PerformClick();
            }
        }

        //OnClick event handler for the button
        private void btnMyButton_Click(object sender, EventArgs e)
        {
            decimal num1;
            decimal num2;

            MessageBox.Show(string.Format("You entered {0} and {1}", txtNum1.Text, txtNum2.Text));

            if (decimal.TryParse(txtNum1.Text, out num1) && decimal.TryParse(txtNum2.Text, out num2))
            {
                //If our new event has any subscribers, invoke the event.
                if(this.MyButtuttonClicked != null)
                    this.MyButtuttonClicked.Invoke(num1, num2);
            }
        }

        private void LoadPlugins()
        {
            var pluginDir = System.IO.Path.GetDirectoryName(Application.ExecutablePath) + @"\Plugins\";

            //Create plugin directory if it doesn't exist
            System.IO.Directory.CreateDirectory(pluginDir);

            //DirectoryCatalog gets a list of all .dll files in our plugin directory
            var catalog = new DirectoryCatalog(pluginDir, "*.dll");

            //CompositionContainer parses all of the dlls in the catalog and identifies their exports
            var container = new CompositionContainer(catalog);

            //Iterate through each export that matches our IPlugin interface and add it to our plugin collection
            foreach (var plugin in container.GetExports<IPlugin, IPluginMetaData>())
            {
                this._plugins.Add(plugin.Metadata.Name, plugin.Value);
                this.clbPlugins.Items.Add(plugin.Metadata.Name, false);
            }

            //Best practice per MS is to dispose the CompositionContainer as soon as we are finished with it.
            container.Dispose();
            catalog.Dispose();
        }

        //ItemCheck event for clbPlugins (checked list box)
        private void clbPlugins_ItemCheck(object sender, ItemCheckEventArgs e)
        {
            var item = clbPlugins.Items[e.Index].ToString();


            if (e.NewValue == CheckState.Checked)
            {
                //If the user checked the box for the plugin, subscribe the plugin's handler to the event
                this.MyButtuttonClicked += this._plugins[item].HandleNumbers;
            }
            else
            {
                //If the user unchecked the box for the plugin, unsubscribe the plugin's handler to the event
                this.MyButtuttonClicked -= this._plugins[item].HandleNumbers;
            }
        }
    }
}

This concludes post #2.  In Post #3 I will create a few example plugins in their own solution without access to any of the code contained in Form1.cs, just as I would expect a 3rd party developer to be able to do.

Index lookup or hash table? Part 2

I ended my last blog post touting the performance advantages of index lookup versus a hash table for data loading as well as data retrieving, however I added a caveat regarding the issue of removing items from the collection. With the hash table it’s pretty straightforward, Microsoft has provided a built-in method for removing an item based on its key that is relatively inexpensive to use. Removing an item from the index collection requires that we iterate through all of the items that came after the item we just removed and update their index value to represent their new position within the collection. If we have a separate client / server design, we must then also synchronize those index changes back to the client so that future calls from the client end up referencing the correct object.

Just how expensive is this maintenance; could index lookup still be a viable solution for collections that have a moderate to heavy amount of removals? Lets find out.

I’m using the same objects and junk csv file as before for my data set. See the files page to acquire the JunkData file for your own testing. Recall that this junk data set is 1 million rows of comma-separated data, the first column of which serves as a numeric row identifier, while the subsequent 6 columns contain random string data. For reference, here are the HashObject and IndexObject definitions again for reference.

    //This class can be used to represent a single row
    sealed class HashObject
    {
        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public HashObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

    //This class can be used to represent a single row
    sealed class IndexObject
    {
        //Added int property to record the object's index location within the List
        public int index { get; set; }

        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public IndexObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

And here is the code I am using to test performance of item removal for both the index lookup and hashtable methods.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<int, HashObject>();
            var index = new List<IndexObject>();

            DateTime start;
            TimeSpan duration;

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);

                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }


            //For the removal speed test we are going to need two collections, since the ID/Index #s do not match up directly
            var idx_removes = new List<int>();
            var hash_removes = new List<int>();

            //Get the list of Keys in the hash table
            var hashkeys = hash.Keys.ToList();


            //Identify set of items to remove

            //End identify set of items to remove

            start = DateTime.Now;
            foreach (var i in hash_removes)
            {
                RemoveHash(i, hash);
            }
            duration = DateTime.Now - start;
            Console.WriteLine("hashtable: {0}", duration.Ticks);


            start = DateTime.Now;
            foreach (var i in idx_removes)
            {
                RemoveIndex(i, index);
            }
            duration = DateTime.Now - start;
            Console.WriteLine("index: {0}", duration.Ticks);



            Console.ReadKey();
        }

        static void RemoveHash(int ID, Dictionary<int, HashObject> hash)
        {
            hash.Remove(ID);
        }

        static void RemoveIndex(int idx, List<IndexObject> index)
        {
            //First we remove the indicated item
            index.RemoveAt(idx);

            //Removing the item will "shift" all  of the following items 1 position towards the
            //  "front" of the List.  Since the old "idx" value was removed, the value that now sits
            //  in the position of idx will also need to have its index value decrimented.
            for (int i = idx; i < index.Count; i++)
            {
                index[i].index--;
            }   
        }
    }

Take special note of lines 37-39 in the code above. From here on out will post several different code snippets to place in that section, rather than posting the code for the entire program over and over. These snippets will test the effects of several different scenarios.

  • Removing an item at or near the “end” of the collection
  • Removing an item at or near the “beginning” of the collection
  • Removing an item at or near the “middle” of the collection

Once again for all of my tests I will be working in Visual Studio 2010 with a .NET 4.0 Command-Line Application in debug mode.

We’ll start with two different tests that remove items near the end of the collection. Test 1 will actually remove the “last” item, and Test 2 will remove an item that is 20000 positions from the end. Here is the code snippet for Test 1.

    idx_removes.Add(index.Count - 1);
    hash_removes.Add(hashkeys[index.Count - 1]);

And here are the results

hashtable: 0
index: 0

Moving right along to test #2, here’s the code.

    idx_removes.Add(index.Count - 20001);
    hash_removes.Add(hashkeys[index.Count - 20001]);

And the results

hashtable: 10032
index: 10032

As long as we’re working with items near the end of the collection, the index collection and hash table seem to be neck and neck.

In the next scenario, we’ll test the reverse situation from what we just tested. We’ll remove the “first” item, as well as an item that is 20000 positions into the collection from the beginning. Here is the code snipped for the first test.

    idx_removes.Add(0);
    hash_removes.Add(hashkeys[0]);

And here are the results

hashtable: 10032
index: 1003200

Moving right on to the second test of this scenario. Here’s the code.

    idx_removes.Add(19999);
    hash_removes.Add(hashkeys[19999]);

And the results

hashtable: 10032
index: 1003200

It’s beginning to become clear that the Remove method built into .NET’s hash table implementation has a fairly low cost regardless of the items position within the collection, while the index collection suffers from a variable amount of overhead that depends on how near to the end of the collection the item being removed is. For our final scenario, we’ll perform a removal from the middle of the collection.

    idx_removes.Add(499999);
    hash_removes.Add(hashkeys[499999]);

And the results

hashtable: 10012
index: 490588

The hash table is the clear winner when it comes to removing items. While index lookup does have an advantage when it comes to the initial load and lookup of items, if you will be doing any significant number of remove operations you may want to look towards the hash table for your needs. However if item removals are rare, index lookup may still be the better choice because of its clear advantage in lookup speed.

As always, questions or comments are welcome.
-Alan

Index Lookup or hash table?

I recently shared out the link to my blog to a few friends and colleagues. After all, that’s how you get readers, right? Anyhow, my best friend, who happens to be a PowerShell guru, was talking to me this morning about my Index Lookup post; and his question was, “why didn’t you just use a hashtable?”

We don’t, at least I don’t, think of the Hashtable very often anymore, ever since .NET 2.0 came along and swept us off our feet with the wonderful happiness that is generics. So when a PowerShell guy says “Hashtable,” what you probably will think of from a .NET perspective is a Dictionary<TKey, TValue>; this is simply the generic version of the Hashtable of old. In simplest terms, a Hashtable provides the ability to “map” a key to a specific value; and then quickly retrieve the value when you supply the key. For an in-depth discussion of hash tables, see Wikipedia, a good discussion of the Dictionary<TKey, TValue> specifically can be found on MSDN here.

I have to admit, my confidence in my Index Lookup post was somewhat shaken when he asked me about it so matter-of-factly. Had I made a mistake, was a hash table a better solution? He and I began collaborating for a little over an hour this morning testing various scenarios, which will be the subject of this post. First, we needed a data set, so I did something I should have done years ago; I generated a CSV file containing 1 million rows of junk data, the first column of which is a semi-sequential integer ID, and then 6 columns of random string data. I’ll post a link to this dataset on the files page. The file looks like 1 million rows of this:

1,tfxnvwcyycrgth,ggqbrhvgd,ltpdhe,udygxoarwof,aiozgrspzejyyukjdm,ryelqidjspjvsqid
2,xtjloonptahderoju,zvwxqciaapjr,prpljdic,rvxiwfnyv,ycjhyzdgdoo,gbsrlpfffllpdaaigc
4,btwtfgjqlgx,nutpvpe,ysqgmqtexzvzabme,imlcytzifx,klfsbjoidqgpmfsmk,fmhofmdtrzzryhsytm
5,cxqndbtcxyxggokz,nvfsphsojlr,wolgsizoyenlkfvttsd,hglyganpyzuq,dbjla,fklnjiuijmyv
8,idrpdnzsujpzjx,gdlhtexdw,bpmrddaddhyllgya,lbvnzllbgbqclkauw,xnaq,pvfpkvs
10,jsjlbbf,ebqxc,bsxyqpfbnv,ugstwagnwhjcyytq,smksdetnsledzhfoqhv,iiajjgdsqya
11,yfiblcqbzgepckwzen,rydqklai,abwtlgetyedoqtnbdl,zjdocqjy,oxpkttbyjkdvgcba,tafqpgatjn
13,nudchjeapsvxsprthl,hcnrwitezqorjfe,phbnwxalsy,naxnth,guzcgtoffjablr,eclfxwjcczye
15,dbdvjtyk,nbhzqrqkrnl,wvbnkuwrwciphbdgfrs,dspkyxfrmrilshh,bhzzbsp,tsspnjezus
16,dgrpmoffeuwwzuyfx,trlderffqdcnjqz,dzjsmkwnkpw,supdggbbjabzjgxbc,lgwnrywffiuxto,jmvnzgcoyccxkjfmv

This structure can be represented in the following objects for a storage in a Dictionary and List respectively

    //This class can be used to represent a single row
    sealed class HashObject
    {
        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public HashObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

    //This class can be used to represent a single row
    sealed class IndexObject
    {
        //Added int property to record the object's index location within the List
        public int index { get; set; }

        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public IndexObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

Clearly there are two different metrics to test. The first is the amount of memory used by each method, the second is the amount of time required to access a given number of objects. We’ll start with testing memory. .NET doesn’t give us many good ways to determine the amount of memory a given object or collection is using, so for our purposes I am going to simply use task manager to determine the amount of memory that the process is using. I will be running all of my code with Visual Studio 2010 as a .NET Framework 4.0 command-line application in debug mode.

First, we’ll measure the memory usage of the hash table implementation. I’ve omitted the HashObject definition for brevity.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<string, HashObject>();

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);
                }
            }            

            Console.ReadKey();
        }
    }

I repeated this process three times and took averages of both memory usage, and the amount of time required to fully load the .csv into the hashtable. The average amount of memory allocated to my application’s process was 327212 KB, and the average amount of time to load was 2763 MS. Now let’s repeat the process with the “index” version. Again, I will omit the IndexObject definition for brevity sake.

    class Program
    {
        static void Main(string[] args)
        {
            var index = new List<IndexObject>();

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }            

            Console.ReadKey();
        }
    }

Again the process was executed three times and the results averaged; memory usage was 310432 KB on average and time to load was an average of 2673 MS. In both cases the index method seems to be more effecient, but is 16780 KB of memory and 90 MS of cpu time worth it? What if the hash table can more quickly satisfy item lookups? Let’s continue onward and find out.

Here is the code snipped I am using to get a distributed list of items to perform a lookup on. Changing the modulo factor (the ‘7’ on line 11 below) to a larger number will decrease the sample size, while decreasing it will increase the sample size.

    //For the lookup speed test we are going to need two collections, since the ID/Index #s do not match up directly
    var idx_lookups = new List<int>();
    var hash_lookups = new List<int>();

    //Get the list of Keys in the hash table
    var hashkeys = hash.Keys.ToList();

    for (int i = 0; i < index.Count; i++)
    {
        //This will give us an approximately 14% sample size
        if (i % 7 == 0)
        {
            idx_lookups.Add(i);

            //This will add the corresponding "ID" to the hash_lookups
            // collection, so both the index and hash method will be looking
            // up the same items
            hash_lookups.Add(hashkeys[i]);
        }
    }

Here is the full code of my program at this point, I’m excluding the definitions for HashObject and IndexObject to save space, and they have not changed from how they were originally posted above. This code will load both the hash table, and my index lookup list; then identify approximately 14% of the records and look them up in both the hash table and the list by index. The amount of time in Ticks (1 Tick = 100 nanoseconds) will be printed to the screen for each method.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<int, HashObject>();
            var index = new List<IndexObject>();

            DateTime start;

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');
                    
                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);
                    
                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }


            //For the lookup speed test we are going to need two collections, since the ID/Index #s do not match up directly
            var idx_lookups = new List<int>();
            var hash_lookups = new List<int>();

            //Get the list of Keys in the hash table
            var hashkeys = hash.Keys.ToList();

            for (int i = 0; i < index.Count; i++)
            {
                //This will give us an approximately 14% sample size
                if (i % 7 == 0)
                {
                    idx_lookups.Add(i);

                    //This will add the corresponding "ID" to the hash_lookups
                    // collection, so both the index and hash method will be looking
                    // up the same items
                    hash_lookups.Add(hashkeys[i]);
                }
            }

            start = DateTime.Now;
            foreach (var i in hash_lookups)
            {
                var obj = hash[i];
            }
            Console.WriteLine("hashtable: {0}", (DateTime.Now - start).Ticks);


            start = DateTime.Now;
            foreach (var i in idx_lookups)
            {
                var obj = index[i];
            }
            Console.WriteLine("index: {0}", (DateTime.Now - start).Ticks);

            
            Console.ReadKey();
        }
    }

On my system, this is the output:

hashtable: 50000
index: 20000

These results seem to remain consistent across different modulo values, with the index lookup method always being between 2 and 2.5 times faster than the hash table lookup.

So in summary

  • The index lookup method uses less memory
  • The index lookup method loads more quickly
  • The index lookup method is able to access items more quickly

The index lookup method does have one big disadvantage to the hash table, which I should probably mention in all fairness since I’ve been wailing on the hash table pretty hard so far. Removing items from the collection is a very dicey proposition, since it would invalidate all of your index properties that came “after” the item you removed. It’s up to you to decide which method is most appropriate for your given development situation.

Questions? Comments? Just leave a post.

Until next time!
-Alan

CSV Importing in C#

Reading and writing flat files was something that I had considered my bread and butter for quite some time. Many years ago, when I was still in high school and considered myself a master of Visual Basic 6, I was involved in a startup; if you could call it that, the goal of which was to set up an online review site of all things outdoors. Camping, hiking, rock climbing, etc. all included on one site. The idea being to allow people to not only get reviews of potential destinations, but to find out what was nearby at the same time. So, for example, upon selecting a potential campground, a site visitor might be able to see other activities within a certain distance. Nowadays I’m sure someone has pulled this idea off, but this isn’t a post about an outdoor activity review site, this is a post about reading CSV files. At that time, I was tasked with writing data-entry client software that could be used by other staff to input all of the information about a given location in discreet data fields; as well as a general description text area for free text. The application had to be usable offline, and then able to later synchronize their data with our database.

I hardly deserved to call myself a developer at this time, but confident in my knowledge I burst forth and wrote glorious VB6 code. I stored the data entered locally in a CSV file and all was well, until some nefarious (in my opinion) user dared to put a comma in their review. Of course this caused the database sync to fail miserably and all hope was lost. I then went about updating the clients code to perform a character replacement on any commas and put in a replacement character, which I would then scan for during the sync operation. All was fixed and glorious once again.

I tell this story because up until recently I was perfectly happy with this method of addressing what, to me, was a limitation of the CSV format. It was simple, fast, and easy to support and maintain. There are, of course, many solutions to this problem when you have full control over the file that is being written and read. You could use a different delimiter, possibly even one that the user would have a very difficult type typing with a regular keyboard; or a fixed width format – which works great as long as the client enforces the same length restrictions. But I’m wondering off topic again.

More recently, I finally encountered a project that required me to actually read CSV files that I did not create. That is, CSV files that someone might have exported from Excel, or Google Docs, or something similar. I no longer had control over the incoming format. Meaning I would have to load data that already had commas of its own within a field. But I’m getting ahead of myself.

Considering the following simple data set in Excel.
Simple excel dataset

When this data is saved at CSV file, here is what the resulting file contents look like.

Col1,Col2
bland,cake
text,is
is,good
always,especially
easy,new
to,york
import,cheesecake

This sort of data can be easily imported using logic similar to the following.

    class Program
    {
        static void Main(string[] args)
        {
            var MyObjects = new List<MyObject>();

            using (var sr = new System.IO.StreamReader(@".\csv_test1.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();
                    var fields = line.Split(',');

                    var obj = new MyObject(fields[0], fields[1]);

                    MyObjects.Add(obj);
                }

                sr.Close();
            }
        }
    }

    class MyObject
    {
        private string m_Col1;
        private string m_Col2;
        public string Col1 { get { return m_Col1; } }
        public string Col2 { get { return m_Col2; } }

        public MyObject(string col1, string col2)
        {
            this.m_Col1 = col1;
            this.m_Col2 = col2;
        }
    }

This is fairly similar to how I would have handled reading a CSV file in the past. Of course, having the benefit of nice clean data certainly helps. Lets look at another example.
A more complicated excel dataset

Saving this dataset as a CSV file and looking at the contents presents a somewhat different picture from before.

Col1,Col2
"2,500",cake
"$2,500.00 ",is
"""twenty-five hundred""",good
"I really like "" marks",especially
"I REALLY like """""" marks",new
"spreading "" them "" around "" here",york
"fake cell ending "", mid cell",cheesecake
"this cell
spans 2 lines",with strawberries

This demonstrates the finer details of what is acceptable in a CSV.

  • If the field contains no commas or quotation marks there’s nothing special that happens; the text is simply placed directly in the file without modification (See line 1 above)
  • If the field contains a comma the field is enclosed in quotation marks: ” (see line 2 above)
  • If the field contains a quotation mark the field is enclosed in quotation marks and the quotation marks within the field are doubled, this serves as a type of “escaping.” (see line 5 above)
  • If the field contains a line feed the field is enclosed in quotation marks (see lines 9 & 10 above)

Seems simple enough, but when I first looked into this I thought “surely there’s a good code snippet out there for this,” so off to Google I went. I found some interesting articles on using an ODBC text driver; but since I was developing within the context of a Metro Windows 8 Style UI application that idea was out. I found an assortment of pre-compiled libraries with various licensing strategies. I also found some suggestions to add a dependency on the Microsoft.VisualBasic namespace and use their TextParser, this also did not appeal to me. So I sat down at the drawing board and came up with a specification.

  1. If a field does not begin with a quotation mark the field ends once the next comma is located or the end of the line is reached
  2. If a field begins with a ” mark then the field ends once a single ” mark is identified since quotation marks within the field are doubled

My new code had to be more generic, capable of importing a CSV file regardless of the number of columns. It also had to work within the new Windows 8 Style-UI restrictions, which means I had to use Windows.Storage.FileIO rather than a StreamReader, however at the time of this writing I only have access to Visual Studio 2010 so bear with me.

The following code provides a self-contained class that imports the CSV file into a collection of Row objects, which are themselves simply a List of string values representative of each column. This is simply a proof of concept, deploying this code would likely involve a few tweaks; such as allowing for column headings, empty rows, a delimiter other than a comma, etc. However I hope this code will be useful to someone who may need some help getting started with their own CSV importing logic without potentially cumbersome licensing requirements.

    class Program
    {
        static void Main(string[] args)
        {
            var MyFile = new CSVFile(@".\csv_test2.csv");

            for (int row = 0; row < MyFile.Rows.Count; row++)
            {
                Console.WriteLine("Row {0}", row.ToString());

                for (int col = 0; col < MyFile.Rows[row].Fields.Count; col++)
                {
                    Console.WriteLine("Field {0}  -  {1}", col.ToString(), MyFile.Rows[row].Fields[col]);
                }

                Console.WriteLine();
            }


            Console.ReadKey();
        }
    }

    public sealed class CSVFile
    {
        private int m_ColumnCount;
        private List<CSVRow> m_Rows;

        public int ColumnCount { get { return m_ColumnCount; } }
        public System.Collections.ObjectModel.ReadOnlyCollection<CSVRow> Rows { get { return m_Rows.AsReadOnly(); } }


        public CSVFile(string path)
        {
            m_Rows = new List<CSVRow>();

            int curCharVal = 0;
            char curChar;
            bool inQuotes = false;
            var curField = new StringBuilder();
            var curRow = new CSVRow();

            try
            {
                using (var sr = new System.IO.StreamReader(path))
                {
                    curCharVal = sr.Read();

                    while (curCharVal >= 0)
                    {
                        //We can't be sure if the file we've received uses Line Feed (char 10) by itself, or Carriage Return / Line Feed (chars 13 / char 10) to indicate a new line
                        //what we can be sure of, however, is that we really don't care if there's a 13!
                        while (curCharVal == 13)
                        {
                            curCharVal = sr.Read();

                            if (curCharVal == -1)
                                break;
                        }

                        //Sanity check, if we ended up with a -1 due to curCharVal == 13 loop...
                        //Should never happen, but god knows what some people's CSV files look like
                        if (curCharVal == -1)
                        {
                            curRow.Fields.Add(curField.ToString());
                            curField.Clear();
                            this.m_Rows.Add(curRow);

                            break;
                        }

                        curChar = (char)curCharVal;

                        if (inQuotes)
                        {
                            //If we're in a quote enclosed field, we need to identify
                            //  if these new quotes are escaped (doubled)
                            //  and if they are, only add a single set of quotes to our
                            //  current field.  If they are not escaped, then we are
                            //  no longer in a quote enclosed field
                            if (curChar == '"')
                            {
                                curCharVal = sr.Read();

                                if (curCharVal >= 0)
                                {
                                    curChar = (char)curCharVal;

                                    if (curChar != '"')
                                    {
                                        inQuotes = false;
                                        
                                        //The new character we just imported (presumably a comma)
                                        //  will be handled once we fall through into the next if block below
                                    }
                                    else
                                    {
                                        curField.Append(curChar);
                                    }
                                }
                            }
                            else
                            {
                                curField.Append(curChar);
                            }
                        }
                        
                        //This is a separate if statement, rather than an else clause
                        //  because within the if block above, the inQuotes value could be
                        //  set to false, in which case we want to evaluate the logic
                        //  within this code block
                        if(!inQuotes)
                        {
                            if (curField.Length == 0 && curChar == '"')
                            {
                                inQuotes = true;
                            }
                            else if (curChar == ',')
                            {
                                curRow.Fields.Add(curField.ToString());
                                curField.Clear();
                            }
                            else if (curCharVal == 10)
                            {
                                curRow.Fields.Add(curField.ToString());
                                curField.Clear();

                                //We're done with this row, add it to the list and set
                                //  ourselves up for a fresh row.
                                this.m_Rows.Add(curRow);
                                curRow = new CSVRow();
                            }
                            else
                            {
                                curField.Append(curChar);
                            }
                        }


                        curCharVal = sr.Read();

                        //We just reached the end of the file.
                        //  Add the current row to the list of rows before the loop ends
                        if (curCharVal == -1)
                        {
                            curRow.Fields.Add(curField.ToString());
                            curField.Clear();
                        }
                    }
                }
            }
            catch
            {
                m_Rows.Clear();
                m_ColumnCount = 0;
            }
        }


        public sealed class CSVRow
        {
            private List<string> m_Fields;

            public List<string> Fields { get { return m_Fields; } }

            public CSVRow()
            {
                m_Fields = new List<string>();
            }
        }
    }

Here’s the output of the above code:

Row 0
Field 0  -  Col1
Field 1  -  Col2

Row 1
Field 0  -  2,500
Field 1  -  cake

Row 2
Field 0  -  $2,500.00
Field 1  -  is

Row 3
Field 0  -  "twenty-five hundred"
Field 1  -  good

Row 4
Field 0  -  I really like " marks
Field 1  -  especially

Row 5
Field 0  -  I REALLY like """ marks
Field 1  -  new

Row 6
Field 0  -  spreading " them " around " here
Field 1  -  york

Row 7
Field 0  -  fake cell ending ", mid cell
Field 1  -  cheesecake

Row 8
Field 0  -  this cell
spans 2 lines
Field 1  -  with strawberries

I’ve used code very similar to this to import 150+ MB CSV files in a matter of seconds without issue.

As always, if there are any questions; feel free to drop me a line. If you have a more elegant solution, I’d love to see that too!

Until next time.

-Alan

Index Lookup

I’d like to start off by talking about a method of looking up items in a List or array that I recently started using when I have been able.  For lack of a better term, when referring to this when talking to a colleage I have been calling it an “Index Lookup.”  If you know of a proper term please let me know in the comments for props.

Consider the following simple console application:

    class Program
    {
        private static List<SomeClass> MyList;

        class SomeClass
        {
            private static Random rand = new Random();

            public int ID { get; set; }
            public int Data { get; set; }

            public SomeClass()
            {
                this.Data = rand.Next(-2147483648, 2147483647);
            }

            public void DoSomeAction()
            {
                if (this.Data > -2147483648)
                    this.Data--;
                else
                    this.Data++;
            }
        }

        static void DoActionByID(int ID)
        {
            var FoundInstance = MyList.Single((item) => item.ID == ID);

            Console.WriteLine("Before: {0}", FoundInstance.Data.ToString());
            FoundInstance.DoSomeAction();
            Console.WriteLine("After: {0}", FoundInstance.Data.ToString());
        }


        static void Main(string[] args)
        {
            MyList = new List<SomeClass>();

            for (int i = 0; i < 30000000; i++)
            {
                //the ID is set to (i+1)*2 to simulate an arbitrary ID value from the DB
                //  or file data source
                var CurInstance = new SomeClass() { ID = (i + 1) * 2};

                MyList.Add(CurInstance);
            }


            //Lets assume the user takes some action and identifies item: 37569360
            //  as the item they want to perform an action on
            DoActionByID(37569360);

            Console.ReadKey();
        }
    }

On my system with this sample program the item search takes just under 1 second (987 ms). Not horrible considering the collection size, but I think we can do better.

I recently started working on a project with a colleague of mine that necessitated looking up n number of objects from a collection based on user selection. Since the design of our application necessitated separating the UI from the underlying objects, the client could not reference the underlying objects directly. Instead, the UI would pass back a list of IDs that were to be included or excluded and then wait for the results to be returned. Initially I was using an approach very similar to the above and performance was horrible. After beating my head against several solutions, I started to think about the fact that you can directly access members of a collection based on their index.

Gets or sets the element at the specified index

So as long as we are able to know what the index position of my items are, we should be good to go. The good news is, knowing an items position is easy! As we all know, the .Add() method of a List adds the item to the end of the List; this means that so long as you don’t sort the List, or use the Insert() method to insert an item at a specific index, that it’s a trivial matter to know the index position within the List of every item you insert. Since the List/Array use a 0-based index system, and the .Count or .Length properties respectively are 1-based; the index value of the next inserted item will be equal to the current (pre-insertion) value of .Count or .Length.

Consider the following modifications to the class, List insertion code, and new DoActionByIndex method.

    class Program
    {
        private static List<SomeClass> MyList;

        class SomeClass
        {
            private static Random rand = new Random();

            public int ID { get; set; }
            public int Data { get; set; }

            //A new value for tracking the index position of the item
            //  this value gets exposed to the UI layer for passback
            //  this causes the client to no longer use the ID field
            //  which is retained solely for the purpose of
            //  updating the record in the original data source
            public int index { get; set; }

            public SomeClass()
            {
                this.Data = rand.Next(-2147483648, 2147483647);
            }

            public void DoSomeAction()
            {
                if (this.Data > -2147483648)
                    this.Data--;
                else
                    this.Data++;
            }
        }

        //static void DoActionByID(int ID)
        //{
        //    var FoundInstance = MyList.Single((item) => item.ID == ID);

        //    Console.WriteLine("Before: {0}", FoundInstance.Data.ToString());
        //    FoundInstance.DoSomeAction();
        //    Console.WriteLine("After: {0}", FoundInstance.Data.ToString());
        //}

        static void DoActionByIndex(int index)
        {
            //Since we are now passing in the index position, we reference the it
            //  directly rather than doing an ID match comparison
            var FoundInstance = MyList[index];

            Console.WriteLine("Before: {0}", FoundInstance.Data.ToString());
            FoundInstance.DoSomeAction();
            Console.WriteLine("After: {0}", FoundInstance.Data.ToString());
        }


        static void Main(string[] args)
        {
            MyList = new List<SomeClass>();

            for (int i = 0; i < 30000000; i++)
            {
                //the ID is set to (i+1)*2 to simulate an arbitrary ID value from the DB
                //  or file data source
                var CurInstance = new SomeClass() { ID = (i + 1) * 2};

                //Since the index position is 0-based, and the Count property of a List
                //  is 1-based we know that the current Count value of the List will
                //  represent the new index of the next item we insert
                CurInstance.index = MyList.Count;


                MyList.Add(CurInstance);
            }


            //Lets assume the user takes some action and identifies item: 18784679
            //  as the item they want to perform an action on
            //  this should be the same item as before, just using index rather than ID

            //  Bear in mind that the Data value is based on a random, so the same messages
            //  will not be printed on the screen!
            DoActionByIndex(18784679);

            Console.ReadKey();
        }
    }

This runs in 1ms or less, or nearly 1000x faster. It is possible to handle multiple ID passes from the client with a similar DoAction method.

    static void DoActionByIndexes(List<int> Indexes)
    {
        foreach (var index in Indexes)
        {
            var FoundInstance = MyList[index];

            Console.WriteLine(ID.ToString());
            Console.WriteLine("Before: {0}", FoundInstance.Data.ToString());
            FoundInstance.DoSomeAction();
            Console.WriteLine("After: {0}", FoundInstance.Data.ToString());
            Console.WriteLine();
        }
    }

I’ve implemented this concept a couple times now, and have been pretty darn happy with the results. It seems so obvious, I’m a bit embarrassed that it took me so long to think of it. So I figured I’d post it just in case anyone else is struggling with a similar problem.

Questions? Comments? Feel free to drop me a line.

-Alan