Using the Managed Extensibility Framework (MEF) to implement a third-party plugin solution. Part 3 – Developing the plugins

Now that my application supports plugins, I’m ready to publish my plugin SDK out for other developers to make use of.  If I take a look in my .\bin\Debug folder for the MyPluginApp project (yes, I’m staying in Debug mode for the purposes of this example) I see the following.

Debug Folder Contents

MyPluginApp.exe of course is my Windows Forms Application executable, but in addition to that there is a file called MyPluginApp.Plugins.dll.  This .dll is the compiled version of the Class Library that I added to my solution in Part 2 named MyPluginApp.Plugins.  This is the .dll that I will need to make available to 3rd party developers (henceforth I will refer to this as my “SDK”).  Along, of course, with some thorough documentation outlining what exactly I am expecting them to implement and what metadata I am expecting.

Now I am going to create a new solution from the perspective of a 3rd party developer who wants to develop a plugin for my application.  I start by creating a new “Class Library” project, because I want my code to compile to a .dll that I can place in the Plugins directory for my application.  I’m going to call my new solution MyPluginApp_StarterPlugins.

Once my new solution and project have been created, I take the MyPluginApp.Plugins.dll file that I downloaded as part of the application developer’s SDK *wink wink* and place it in my project directory.  I then add the file to my project, making sure to set the Build Action to ‘None’ and Copy to Output Directory to ‘Do Not Copy.’  The step of explicitly adding the .dll file to my solution is not required, however I think it adds some clarity to what the purpose of my project is.  Now that I’ve physically gotten the SDK into the appropriate place, I need to add a reference to it within the project.  I do this by adding a reference to the actual .dll file that I just moved into my project’s directory.  Also, just to keep Visual Studio from copying the .dll to my bin\{Debug}{Release} folders, I will set “Copy Local” property of the reference to false.  I also need to add a reference to System.Windows.Forms because my intention is to use MessageBox when my plugins perform their tasks.  Finally, I need to add a reference to System.ComponentModel.Composition in order to gain access to the necessary MEF namespaces.

Project Directory

Project Directory Contents

Solution Explorer Containing File

Solution Explorer View

Reference

Solution Explorer References

I’m going to develop two simple plugins within this project.  The first will be called AdditionPlugin, and will add the two numbers together.  The second will be called MultiplicationPlugin, and will multiply the two numbers together.  These plugins will both impliment the IPlugin interface which we defined in the MyPluginApp.Plugins projects and is available to me as a 3rd party developer within the .dll that I have added to my project.

Now I’m ready to define my first plugin.  First I will create the AdditionPlugin.  I rename the Class1.cs default file to AdditionPlugin.cs and replace its contents with the following:

using System;

using System.ComponentModel.Composition; //MEF namespace for Exporting
using System.Windows.Forms; 

namespace MyPluginApp.Plugins
{
    [Export(typeof(IPlugin))] //**1
    [ExportMetadata("Name", "Addition Plugin")] //**2
    public class AdditionPlugin : IPlugin
    {
        public void HandleNumbers(decimal num1, decimal num2)
        {
            MessageBox.Show(string.Format("{0} + {1} = {2}", num1, num2, num1 + num2), "Addition Plugin");
        }
    }


    
    //**1 - This will allow the CompositionContainer in MyPluginApp know that this class is of type IPlugin, which will be loaded 
    //  when we call .GetExports<IPlugin, IPluginMetaData() in the LoadPlugins() method.

    //**2 - This is how MetaData is exported by the Plugin.  The exported metadata needs to match with the metadata interface we defined in MyPluginApp.
    //  In this case we have only 1 defined field in IPluginMetaData called "Name", so that is the only peice of metadata that we want to export.
}

Now if this plugin is enabled (remember the Checked List Box in MyPluginApp?), its HandleNumbers method will be added to the MyButtonClicked event, which gets Invoked every time the button on my form gets clicked.

Similarly to the AdditionPlugin, I add a new .cs file to the project and name it MultiplicationPlugin.cs.  Here are the contents of that file:

using System;
using System.ComponentModel.Composition; //MEF namespace for Exporting
using System.Windows.Forms;

namespace MyPluginApp.Plugins
{
    [Export(typeof(IPlugin))]
    [ExportMetadata("Name", "Multiplication Plugin")]
    public class MultiplicationPlugin : IPlugin
    {
        public void HandleNumbers(decimal num1, decimal num2)
        {
            MessageBox.Show(string.Format("{0} * {1} = {2}", num1, num2, num1 * num2), "Multiplication Plugin");
        }
    }
}

Now I can build my solution.  After the build is complete, I find the following in MyPluginApp_StarterPlugins bin\Debug folder:

image

MyPluginApp_StarterPlugins.dll needs only to be placed in the \Plugins directory of MyPluginApp prior to MyPluginApp being started up.  Here is what MyPluginApp looks like after I do just that.

MyPluginApp with loaded plugins

Clicking on MyButton at this point still yields the familiar output from before Plugin support was added.

MyPluginApp Output no plugins enabled

Clicking OK on the MessageBox dialog returns me to the application.  Now I change Number 1 to “8,” check both boxes, and then click MyButton.

MyPluginApp Enabled Plugins

MyPluginApp - App Output

MyPluginApp - Addition Plugin Output

MyPluginApp - Multiplication Plugin Output

Unchecking the box next to a plugin prior to clicking MyButton again causes that plugin to become unsubscribed from the event, so its MessageBox no longer pops up.

I’ve certainly only begun to scratch the surface of MEF in this series of posts.  But I hope that this simple tutorial showing a method of developing plugin .dll files completely independently of your original application’s code/solution will be of some value to someone.

I have posted a .zip file containing both complete solutions Here.

Using the Managed Extensibility Framework (MEF) to implement a third-party plugin solution. Part 2–Plugin Support in the Application

Now that I have a feature-complete application from my perspective, I am ready to add plugin support so that other developers can extend the functionality of my application.  I don’t want to make my application open source, so I want to just expose enough information about my application to them that they are able to create plugins.

In order to give my application access to MEF, I need to add a reference to the appropriate namespace within my windows forms application.  The namespace I need to reference is System.ComponentModel.Composition.  Once I’ve done that, I need to add a new project to my solution of type “Class Library.”  This project will be my “Plugin API” and will compile to a .dll file that I can distribute to developers as a part of my SDK so that they can create plugins that adhere to my standard.  After creating my new project, I need add a reference to that project from within my Windows Forms project.

At this point, my application solution looks like this:

Solution Explorer

Here are the contents of IPlugin.cs in the MyPluginApp.Plugins project:

using System;

namespace MyPluginApp.Plugins
{
    //This interface defines what we expect a plugin developer to impliment in order for our application to make use of their plugin
    public interface IPlugin
    {
        //This method's signature matches the MyButtonClickedEventHandler in our application
        void HandleNumbers(decimal num1, decimal num2);
    }

    //MEF allows for metadata to be exported along with the object itself, we need a separate interface to receive any metatdata that we expect
    // plugin developers to include as well.
    public interface IPluginMetaData
    {
        string Name { get;}
    }
}

Next I need to update my UI so that it is able to display the list of available plugins and allow the user enable/disable them at will.  I’ve opted to use a checked list box to achieve that goal.  The plugins will load initially disabled, and the user simply needs to check the box next to the plugin in order to enable it, similarly unchecking the box will cause the plugin to be disabled.

Here is my New UI

MyPluginApp

And now the code for the form, including the code to enable plugin support.  New lines of code are highlighted for your convenience.

using System;
using System.Windows.Forms;

using System.Collections.Generic;
using MyPluginApp.Plugins;  //Our plugin namespace
using System.ComponentModel.Composition.Hosting; //MEF namespace

namespace MyPluginApp
{
    public partial class Form1 : Form
    {
        private delegate void MyButtonClickedEventHandler(decimal num1, decimal num2);
        private event MyButtonClickedEventHandler MyButtuttonClicked;

        private readonly Dictionary<string, IPlugin> _plugins = new Dictionary<string, IPlugin>();

        public Form1()
        {
            InitializeComponent();

            this.LoadPlugins();
        }

        //the Leave event handler for both text boxes
        private void textNum_Leave(object sender, EventArgs e)
        {
            decimal d;
            var textNum = (TextBox) sender;

            if (!decimal.TryParse(textNum.Text, out d))
            {
                textNum.Text = "5";
            }
        }

        //The KeyDown event handler for both text boxes
        private void txtNum_KeyDown(object sender, KeyEventArgs e)
        {
            if (e.KeyData == Keys.Enter || e.KeyData == Keys.Return)
            {
                btnMyButton.Focus(); //fires the Leave event for the active text box
                btnMyButton.PerformClick();
            }
        }

        //OnClick event handler for the button
        private void btnMyButton_Click(object sender, EventArgs e)
        {
            decimal num1;
            decimal num2;

            MessageBox.Show(string.Format("You entered {0} and {1}", txtNum1.Text, txtNum2.Text));

            if (decimal.TryParse(txtNum1.Text, out num1) && decimal.TryParse(txtNum2.Text, out num2))
            {
                //If our new event has any subscribers, invoke the event.
                if(this.MyButtuttonClicked != null)
                    this.MyButtuttonClicked.Invoke(num1, num2);
            }
        }

        private void LoadPlugins()
        {
            var pluginDir = System.IO.Path.GetDirectoryName(Application.ExecutablePath) + @"\Plugins\";

            //Create plugin directory if it doesn't exist
            System.IO.Directory.CreateDirectory(pluginDir);

            //DirectoryCatalog gets a list of all .dll files in our plugin directory
            var catalog = new DirectoryCatalog(pluginDir, "*.dll");

            //CompositionContainer parses all of the dlls in the catalog and identifies their exports
            var container = new CompositionContainer(catalog);

            //Iterate through each export that matches our IPlugin interface and add it to our plugin collection
            foreach (var plugin in container.GetExports<IPlugin, IPluginMetaData>())
            {
                this._plugins.Add(plugin.Metadata.Name, plugin.Value);
                this.clbPlugins.Items.Add(plugin.Metadata.Name, false);
            }

            //Best practice per MS is to dispose the CompositionContainer as soon as we are finished with it.
            container.Dispose();
            catalog.Dispose();
        }

        //ItemCheck event for clbPlugins (checked list box)
        private void clbPlugins_ItemCheck(object sender, ItemCheckEventArgs e)
        {
            var item = clbPlugins.Items[e.Index].ToString();


            if (e.NewValue == CheckState.Checked)
            {
                //If the user checked the box for the plugin, subscribe the plugin's handler to the event
                this.MyButtuttonClicked += this._plugins[item].HandleNumbers;
            }
            else
            {
                //If the user unchecked the box for the plugin, unsubscribe the plugin's handler to the event
                this.MyButtuttonClicked -= this._plugins[item].HandleNumbers;
            }
        }
    }
}

This concludes post #2.  In Post #3 I will create a few example plugins in their own solution without access to any of the code contained in Form1.cs, just as I would expect a 3rd party developer to be able to do.

Using the Managed Extensibility Framework (MEF) to implement a third-party plugin solution. Part 1–Introduction

I recently started working on a project that would benefit greatly from the ability to support plugins that were developed by someone other than myself or anyone else who necessarily had access to our source code. Having never really done anything like this myself I set off and tried to research it. I found a lot of information about loading compiled .dll files, using reflection and Activator.CreateInstance and so on. Then in the comments of those posts I’d see someone say “Why don’t you use MEF?” and that’s it; no other information.

In a quick nutshell, here’s what Wikipedia has to say about MEF as of August 2014:

Managed Extensibility Framework (MEF) is a component of .NET Framework 4.0 aiming to create lightweight, extensible applications. It allows application developers to discover and use extensions with no configuration required. It also lets extension developers easily encapsulate code and avoid fragile hard dependencies. MEF also allows extensions to be reused across applications. MEF was introduced as a part of .NET 4.0 and Silverlight 4.

Additionally, here is the MSDN page for your reference.

In my research, I found a lot of great examples of using MEF to quickly load external dll files and instantiate the types that they had defined within them. The problem was, all of the examples I found had all of the projects (application and plugins) contained within a single solution. I wanted to know how to enable someone else to take my plugin specification and implement it, without me having to distribute my entire solution to them. I never did find a true tutorial on that aspect of MEF, but I did figure out a way that works and that is the way I am going to present in this series. If anyone happens upon this series and has a better method, I’d be happy to learn from you!

In this series of posts I am going to go through the process of creating a .NET 4.0 Windows Forms application that implements a simple plugin system.  The plugins system will load any available plugins on startup and allow the user to enable/disable them.  I will then create an entirely separate solution that creates some example plugins that can be loaded an used by my original program.  This should simulate the workflow that I would expect a third-party developer to go through to develop plugins for my application.

The series will consist of 3 posts.  The first post, which you are now reading, will serve as the introduction and cover the initial application (without plugin support).  In the second post, I will add plugin handling capabilities to the application, and in the third post I will create a couple sample plugins within their own solution.

I’m going to use Visual Studio 2010 for this project.  I will include zipped archives for both solutions at the end of the series.

To start out, I have created a simple application that allows the user to enter two numbers, then reports back to them what they entered.

Application

The Application

Result of pressing the button

Result of Button Press

And the code for my application

using System;
using System.Windows.Forms;

namespace MyPluginApp
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        //the Leave event handler for both text boxes
        private void textNum_Leave(object sender, EventArgs e)
        {
            decimal d;
            var textNum = (TextBox) sender;

            if (!decimal.TryParse(textNum.Text, out d))
            {
                textNum.Text = "5";
            }
        }

        //The KeyDown event handler for both text boxes
        private void txtNum_KeyDown(object sender, KeyEventArgs e)
        {
            if (e.KeyData == Keys.Enter || e.KeyData == Keys.Return)
            {
                btnMyButton.Focus(); //fires the Leave event for the active text box
                btnMyButton.PerformClick();
            }
        }

        //OnClick event handler for the button
        private void btnMyButton_Click(object sender, EventArgs e)
        {
            MessageBox.Show(string.Format("You entered {0} and {1}", txtNum1.Text, txtNum2.Text));
        }
    }
}

This concludes the first post of the series.  In the next post I will add the capability for my application to load and utilize plugins.

Renaming poorly named constraints in MS SQL

I recently found myself in a situation where I was working on tuning SQL Server performance in a SQL Server 2008 R2 environment. What made this environment different than other environments that I had worked on in the past, was that the environment had ~400 databases that were all (essentially) identical in structure. The company sells a remote-hosted solution, so whenever a new client comes on board they simply spin each client off their own appropriately structured database that they can then begin filling with their own data.

In the process of tuning the SQL performance, there were many instances where I needed to make changes to Primary and Foreign keys, unfortunately many of the keys did not have the same name in each database; because when the tables were originally created the constraints were not explicitly named; and were instead named by SQL Server itself.

Consider these simple examples.

--Implicit
CREATE TABLE dbo.Users
(
	id_Users int IDENTITY(1,1) NOT NULL PRIMARY KEY
	, UserName nvarchar(20) NOT NULL
	, UserPassword nvarchar(20) NOT NULL
);

CREATE TABLE dbo.UserLoginHistory
(
	id_Users int NOT NULL FOREIGN KEY REFERENCES dbo.Users (id_Users)
	, LoginTime datetime2 NOT NULL
);
CREATE CLUSTERED INDEX cidx_dbo_UserLoginHistory ON dbo.UserLoginHistory (id_Users);

select name, 'pk' AS [type] FROM sys.key_constraints WHERE parent_object_id = OBJECT_ID('dbo.Users')
union all
select name, 'fk' AS [type] FROM sys.foreign_keys WHERE parent_object_id = OBJECT_ID('dbo.UserLoginHistory');

And here is an example of explicitly defined primary and foreign keys:

--Explicit
CREATE TABLE dbo.Users
(
	id_Users int IDENTITY(1,1) NOT NULL
	, UserName nvarchar(20) NOT NULL
	, UserPassword nvarchar(20) NOT NULL
	, CONSTRAINT pk_dbo_Users PRIMARY KEY CLUSTERED (id_Users)
);

CREATE TABLE dbo.UserLoginHistory
(
	id_Users int NOT NULL
	, LoginTime datetime2 NOT NULL
	, CONSTRAINT fk_dbo_UserLoginHistory_dbo_Users FOREIGN KEY (id_Users) REFERENCES dbo.Users (id_Users)
);
CREATE CLUSTERED INDEX cidx_dbo_UserLoginHistory ON dbo.UserLoginHistory (id_Users);

select name, 'pk' AS [type] FROM sys.key_constraints WHERE parent_object_id = OBJECT_ID('dbo.Users')
union all
select name, 'fk' AS [type] FROM sys.foreign_keys WHERE parent_object_id = OBJECT_ID('dbo.UserLoginHistory');

And the results from each:

Implicit
type	name
pk	PK__Users__B3F21DFE5BED93EA
fk	FK__UserLogin__id_Us__5ECA0095

Explicit
type	name
pk	pk_dbo_Users
fk	fk_dbo_UserLoginHistory_dbo_Users

When you allow SQL Server to name your constraints for you, it will generate a name that is partially based on the object name, then append a set of random hex characters to the end. So in the case of the system I was working in, this resulted in 400 different names for the same constraints. So any maintenance I wanted to perform against those constraints would first have to dynamically identify the name of the constraint before the logic could proceed. Further, any future maintenance would require the same level of effort. I opted to create these two simple scripts which would name all Primary and Foreign keys based on a defined naming convention, thus bringing all of the databases into sync and making future maintenance much easier, so long as future development does not use implicit constraint creation techniques.

-------------------------------------------
--Rename all system generated primary keys using the following format
-- pk_schema_tablename
-------------------------------------------
DECLARE @oldname nvarchar(776)
        , @newname sysname;

DECLARE key_curse CURSOR FAST_FORWARD FOR (SELECT
                                                '[' + SCHEMA_NAME(kc.schema_id) + '].[' + kc.name + ']' AS [oldname]
                                                , 'pk_' + SCHEMA_NAME(kc.schema_id) + '_' + OBJECT_NAME(kc.parent_object_id) AS [newname]
                                                FROM sys.key_constraints kc
                                                WHERE
                                                    kc.type = 'PK'
                                                    AND kc.name <> 'pk_' + SCHEMA_NAME(kc.schema_id) + '_' + OBJECT_NAME(kc.parent_object_id));

OPEN key_curse

    FETCH NEXT FROM key_curse INTO @oldname, @newname;

    WHILE(@@FETCH_STATUS = 0)
    BEGIN

        print substring(@oldname, charindex('.', @oldname) + 1, len(@oldname)) + '     ' + @newname;
        EXEC sp_rename @objname = @oldname, @newname = @newname, @objtype = 'OBJECT';

        FETCH NEXT FROM key_curse INTO @oldname, @newname;
    END

CLOSE key_curse;
DEALLOCATE key_curse;
GO

-------------------------------------------
--Rename all system generated foreign keys using the following format
-- fk_schema1_tablename1_schema2_tablename2
-- where schema/table 1 = the ReferencING table
-- and schema/table 2 = the ReferencED table
-------------------------------------------
DECLARE @oldname nvarchar(776)
		, @newname sysname;

DECLARE key_curse CURSOR FAST_FORWARD FOR (select
											'[' + SCHEMA_NAME(o_ing.schema_id) + '].[' + fk.name + ']' AS [oldname]
											, 'fk_' + SCHEMA_NAME(o_ing.schema_id) + '_' + o_ing.name + '_' + SCHEMA_NAME(o_ed.schema_id) + '_' + o_ed.name AS [newname]
											FROM sys.foreign_keys fk
											JOIN sys.objects o_ing ON o_ing.object_id = fk.parent_object_id
											JOIN sys.objects o_ed ON o_ed.object_id = fk.referenced_object_id
											WHERE
												fk.name <> 'fk_' + SCHEMA_NAME(o_ing.schema_id) + '_' + o_ing.name + '_' + SCHEMA_NAME(o_ed.schema_id) + '_' + o_ed.name); --The format of our standard key

OPEN key_curse

	FETCH NEXT FROM key_curse INTO @oldname, @newname;

	WHILE(@@FETCH_STATUS = 0)
	BEGIN

		print SUBSTRING(@oldname, charindex('.', @oldname) + 1, len(@oldname)) + '     ' + @newname;
		EXEC sp_rename @objname = @oldname, @newname = @newname, @objtype = 'OBJECT';

		FETCH NEXT FROM key_curse INTO @oldname, @newname;
	END

CLOSE key_curse;
DEALLOCATE key_curse;
GO

There was already a system in place to apply a SQL script to every client database, so there was no need to employ sp_MSForeachdb (Just Kidding!) Aaron Bertrand’s sp_foreachdb.

I haven’t tested this script against any other versions of MSSQL, but I think it should work fine on anything >= SQL Server 2005.

In closing, to all you developers out there, please take the time to explicitly name your constraints!

Index lookup or hash table? Part 2

I ended my last blog post touting the performance advantages of index lookup versus a hash table for data loading as well as data retrieving, however I added a caveat regarding the issue of removing items from the collection. With the hash table it’s pretty straightforward, Microsoft has provided a built-in method for removing an item based on its key that is relatively inexpensive to use. Removing an item from the index collection requires that we iterate through all of the items that came after the item we just removed and update their index value to represent their new position within the collection. If we have a separate client / server design, we must then also synchronize those index changes back to the client so that future calls from the client end up referencing the correct object.

Just how expensive is this maintenance; could index lookup still be a viable solution for collections that have a moderate to heavy amount of removals? Lets find out.

I’m using the same objects and junk csv file as before for my data set. See the files page to acquire the JunkData file for your own testing. Recall that this junk data set is 1 million rows of comma-separated data, the first column of which serves as a numeric row identifier, while the subsequent 6 columns contain random string data. For reference, here are the HashObject and IndexObject definitions again for reference.

    //This class can be used to represent a single row
    sealed class HashObject
    {
        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public HashObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

    //This class can be used to represent a single row
    sealed class IndexObject
    {
        //Added int property to record the object's index location within the List
        public int index { get; set; }

        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public IndexObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

And here is the code I am using to test performance of item removal for both the index lookup and hashtable methods.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<int, HashObject>();
            var index = new List<IndexObject>();

            DateTime start;
            TimeSpan duration;

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);

                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }


            //For the removal speed test we are going to need two collections, since the ID/Index #s do not match up directly
            var idx_removes = new List<int>();
            var hash_removes = new List<int>();

            //Get the list of Keys in the hash table
            var hashkeys = hash.Keys.ToList();


            //Identify set of items to remove

            //End identify set of items to remove

            start = DateTime.Now;
            foreach (var i in hash_removes)
            {
                RemoveHash(i, hash);
            }
            duration = DateTime.Now - start;
            Console.WriteLine("hashtable: {0}", duration.Ticks);


            start = DateTime.Now;
            foreach (var i in idx_removes)
            {
                RemoveIndex(i, index);
            }
            duration = DateTime.Now - start;
            Console.WriteLine("index: {0}", duration.Ticks);



            Console.ReadKey();
        }

        static void RemoveHash(int ID, Dictionary<int, HashObject> hash)
        {
            hash.Remove(ID);
        }

        static void RemoveIndex(int idx, List<IndexObject> index)
        {
            //First we remove the indicated item
            index.RemoveAt(idx);

            //Removing the item will "shift" all  of the following items 1 position towards the
            //  "front" of the List.  Since the old "idx" value was removed, the value that now sits
            //  in the position of idx will also need to have its index value decrimented.
            for (int i = idx; i < index.Count; i++)
            {
                index[i].index--;
            }   
        }
    }

Take special note of lines 37-39 in the code above. From here on out will post several different code snippets to place in that section, rather than posting the code for the entire program over and over. These snippets will test the effects of several different scenarios.

  • Removing an item at or near the “end” of the collection
  • Removing an item at or near the “beginning” of the collection
  • Removing an item at or near the “middle” of the collection

Once again for all of my tests I will be working in Visual Studio 2010 with a .NET 4.0 Command-Line Application in debug mode.

We’ll start with two different tests that remove items near the end of the collection. Test 1 will actually remove the “last” item, and Test 2 will remove an item that is 20000 positions from the end. Here is the code snippet for Test 1.

    idx_removes.Add(index.Count - 1);
    hash_removes.Add(hashkeys[index.Count - 1]);

And here are the results

hashtable: 0
index: 0

Moving right along to test #2, here’s the code.

    idx_removes.Add(index.Count - 20001);
    hash_removes.Add(hashkeys[index.Count - 20001]);

And the results

hashtable: 10032
index: 10032

As long as we’re working with items near the end of the collection, the index collection and hash table seem to be neck and neck.

In the next scenario, we’ll test the reverse situation from what we just tested. We’ll remove the “first” item, as well as an item that is 20000 positions into the collection from the beginning. Here is the code snipped for the first test.

    idx_removes.Add(0);
    hash_removes.Add(hashkeys[0]);

And here are the results

hashtable: 10032
index: 1003200

Moving right on to the second test of this scenario. Here’s the code.

    idx_removes.Add(19999);
    hash_removes.Add(hashkeys[19999]);

And the results

hashtable: 10032
index: 1003200

It’s beginning to become clear that the Remove method built into .NET’s hash table implementation has a fairly low cost regardless of the items position within the collection, while the index collection suffers from a variable amount of overhead that depends on how near to the end of the collection the item being removed is. For our final scenario, we’ll perform a removal from the middle of the collection.

    idx_removes.Add(499999);
    hash_removes.Add(hashkeys[499999]);

And the results

hashtable: 10012
index: 490588

The hash table is the clear winner when it comes to removing items. While index lookup does have an advantage when it comes to the initial load and lookup of items, if you will be doing any significant number of remove operations you may want to look towards the hash table for your needs. However if item removals are rare, index lookup may still be the better choice because of its clear advantage in lookup speed.

As always, questions or comments are welcome.
-Alan

Index Lookup or hash table?

I recently shared out the link to my blog to a few friends and colleagues. After all, that’s how you get readers, right? Anyhow, my best friend, who happens to be a PowerShell guru, was talking to me this morning about my Index Lookup post; and his question was, “why didn’t you just use a hashtable?”

We don’t, at least I don’t, think of the Hashtable very often anymore, ever since .NET 2.0 came along and swept us off our feet with the wonderful happiness that is generics. So when a PowerShell guy says “Hashtable,” what you probably will think of from a .NET perspective is a Dictionary<TKey, TValue>; this is simply the generic version of the Hashtable of old. In simplest terms, a Hashtable provides the ability to “map” a key to a specific value; and then quickly retrieve the value when you supply the key. For an in-depth discussion of hash tables, see Wikipedia, a good discussion of the Dictionary<TKey, TValue> specifically can be found on MSDN here.

I have to admit, my confidence in my Index Lookup post was somewhat shaken when he asked me about it so matter-of-factly. Had I made a mistake, was a hash table a better solution? He and I began collaborating for a little over an hour this morning testing various scenarios, which will be the subject of this post. First, we needed a data set, so I did something I should have done years ago; I generated a CSV file containing 1 million rows of junk data, the first column of which is a semi-sequential integer ID, and then 6 columns of random string data. I’ll post a link to this dataset on the files page. The file looks like 1 million rows of this:

1,tfxnvwcyycrgth,ggqbrhvgd,ltpdhe,udygxoarwof,aiozgrspzejyyukjdm,ryelqidjspjvsqid
2,xtjloonptahderoju,zvwxqciaapjr,prpljdic,rvxiwfnyv,ycjhyzdgdoo,gbsrlpfffllpdaaigc
4,btwtfgjqlgx,nutpvpe,ysqgmqtexzvzabme,imlcytzifx,klfsbjoidqgpmfsmk,fmhofmdtrzzryhsytm
5,cxqndbtcxyxggokz,nvfsphsojlr,wolgsizoyenlkfvttsd,hglyganpyzuq,dbjla,fklnjiuijmyv
8,idrpdnzsujpzjx,gdlhtexdw,bpmrddaddhyllgya,lbvnzllbgbqclkauw,xnaq,pvfpkvs
10,jsjlbbf,ebqxc,bsxyqpfbnv,ugstwagnwhjcyytq,smksdetnsledzhfoqhv,iiajjgdsqya
11,yfiblcqbzgepckwzen,rydqklai,abwtlgetyedoqtnbdl,zjdocqjy,oxpkttbyjkdvgcba,tafqpgatjn
13,nudchjeapsvxsprthl,hcnrwitezqorjfe,phbnwxalsy,naxnth,guzcgtoffjablr,eclfxwjcczye
15,dbdvjtyk,nbhzqrqkrnl,wvbnkuwrwciphbdgfrs,dspkyxfrmrilshh,bhzzbsp,tsspnjezus
16,dgrpmoffeuwwzuyfx,trlderffqdcnjqz,dzjsmkwnkpw,supdggbbjabzjgxbc,lgwnrywffiuxto,jmvnzgcoyccxkjfmv

This structure can be represented in the following objects for a storage in a Dictionary and List respectively

    //This class can be used to represent a single row
    sealed class HashObject
    {
        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public HashObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

    //This class can be used to represent a single row
    sealed class IndexObject
    {
        //Added int property to record the object's index location within the List
        public int index { get; set; }

        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public IndexObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

Clearly there are two different metrics to test. The first is the amount of memory used by each method, the second is the amount of time required to access a given number of objects. We’ll start with testing memory. .NET doesn’t give us many good ways to determine the amount of memory a given object or collection is using, so for our purposes I am going to simply use task manager to determine the amount of memory that the process is using. I will be running all of my code with Visual Studio 2010 as a .NET Framework 4.0 command-line application in debug mode.

First, we’ll measure the memory usage of the hash table implementation. I’ve omitted the HashObject definition for brevity.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<string, HashObject>();

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);
                }
            }            

            Console.ReadKey();
        }
    }

I repeated this process three times and took averages of both memory usage, and the amount of time required to fully load the .csv into the hashtable. The average amount of memory allocated to my application’s process was 327212 KB, and the average amount of time to load was 2763 MS. Now let’s repeat the process with the “index” version. Again, I will omit the IndexObject definition for brevity sake.

    class Program
    {
        static void Main(string[] args)
        {
            var index = new List<IndexObject>();

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }            

            Console.ReadKey();
        }
    }

Again the process was executed three times and the results averaged; memory usage was 310432 KB on average and time to load was an average of 2673 MS. In both cases the index method seems to be more effecient, but is 16780 KB of memory and 90 MS of cpu time worth it? What if the hash table can more quickly satisfy item lookups? Let’s continue onward and find out.

Here is the code snipped I am using to get a distributed list of items to perform a lookup on. Changing the modulo factor (the ‘7’ on line 11 below) to a larger number will decrease the sample size, while decreasing it will increase the sample size.

    //For the lookup speed test we are going to need two collections, since the ID/Index #s do not match up directly
    var idx_lookups = new List<int>();
    var hash_lookups = new List<int>();

    //Get the list of Keys in the hash table
    var hashkeys = hash.Keys.ToList();

    for (int i = 0; i < index.Count; i++)
    {
        //This will give us an approximately 14% sample size
        if (i % 7 == 0)
        {
            idx_lookups.Add(i);

            //This will add the corresponding "ID" to the hash_lookups
            // collection, so both the index and hash method will be looking
            // up the same items
            hash_lookups.Add(hashkeys[i]);
        }
    }

Here is the full code of my program at this point, I’m excluding the definitions for HashObject and IndexObject to save space, and they have not changed from how they were originally posted above. This code will load both the hash table, and my index lookup list; then identify approximately 14% of the records and look them up in both the hash table and the list by index. The amount of time in Ticks (1 Tick = 100 nanoseconds) will be printed to the screen for each method.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<int, HashObject>();
            var index = new List<IndexObject>();

            DateTime start;

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');
                    
                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);
                    
                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }


            //For the lookup speed test we are going to need two collections, since the ID/Index #s do not match up directly
            var idx_lookups = new List<int>();
            var hash_lookups = new List<int>();

            //Get the list of Keys in the hash table
            var hashkeys = hash.Keys.ToList();

            for (int i = 0; i < index.Count; i++)
            {
                //This will give us an approximately 14% sample size
                if (i % 7 == 0)
                {
                    idx_lookups.Add(i);

                    //This will add the corresponding "ID" to the hash_lookups
                    // collection, so both the index and hash method will be looking
                    // up the same items
                    hash_lookups.Add(hashkeys[i]);
                }
            }

            start = DateTime.Now;
            foreach (var i in hash_lookups)
            {
                var obj = hash[i];
            }
            Console.WriteLine("hashtable: {0}", (DateTime.Now - start).Ticks);


            start = DateTime.Now;
            foreach (var i in idx_lookups)
            {
                var obj = index[i];
            }
            Console.WriteLine("index: {0}", (DateTime.Now - start).Ticks);

            
            Console.ReadKey();
        }
    }

On my system, this is the output:

hashtable: 50000
index: 20000

These results seem to remain consistent across different modulo values, with the index lookup method always being between 2 and 2.5 times faster than the hash table lookup.

So in summary

  • The index lookup method uses less memory
  • The index lookup method loads more quickly
  • The index lookup method is able to access items more quickly

The index lookup method does have one big disadvantage to the hash table, which I should probably mention in all fairness since I’ve been wailing on the hash table pretty hard so far. Removing items from the collection is a very dicey proposition, since it would invalidate all of your index properties that came “after” the item you removed. It’s up to you to decide which method is most appropriate for your given development situation.

Questions? Comments? Just leave a post.

Until next time!
-Alan

CSV Importing in C#

Reading and writing flat files was something that I had considered my bread and butter for quite some time. Many years ago, when I was still in high school and considered myself a master of Visual Basic 6, I was involved in a startup; if you could call it that, the goal of which was to set up an online review site of all things outdoors. Camping, hiking, rock climbing, etc. all included on one site. The idea being to allow people to not only get reviews of potential destinations, but to find out what was nearby at the same time. So, for example, upon selecting a potential campground, a site visitor might be able to see other activities within a certain distance. Nowadays I’m sure someone has pulled this idea off, but this isn’t a post about an outdoor activity review site, this is a post about reading CSV files. At that time, I was tasked with writing data-entry client software that could be used by other staff to input all of the information about a given location in discreet data fields; as well as a general description text area for free text. The application had to be usable offline, and then able to later synchronize their data with our database.

I hardly deserved to call myself a developer at this time, but confident in my knowledge I burst forth and wrote glorious VB6 code. I stored the data entered locally in a CSV file and all was well, until some nefarious (in my opinion) user dared to put a comma in their review. Of course this caused the database sync to fail miserably and all hope was lost. I then went about updating the clients code to perform a character replacement on any commas and put in a replacement character, which I would then scan for during the sync operation. All was fixed and glorious once again.

I tell this story because up until recently I was perfectly happy with this method of addressing what, to me, was a limitation of the CSV format. It was simple, fast, and easy to support and maintain. There are, of course, many solutions to this problem when you have full control over the file that is being written and read. You could use a different delimiter, possibly even one that the user would have a very difficult type typing with a regular keyboard; or a fixed width format – which works great as long as the client enforces the same length restrictions. But I’m wondering off topic again.

More recently, I finally encountered a project that required me to actually read CSV files that I did not create. That is, CSV files that someone might have exported from Excel, or Google Docs, or something similar. I no longer had control over the incoming format. Meaning I would have to load data that already had commas of its own within a field. But I’m getting ahead of myself.

Considering the following simple data set in Excel.
Simple excel dataset

When this data is saved at CSV file, here is what the resulting file contents look like.

Col1,Col2
bland,cake
text,is
is,good
always,especially
easy,new
to,york
import,cheesecake

This sort of data can be easily imported using logic similar to the following.

    class Program
    {
        static void Main(string[] args)
        {
            var MyObjects = new List<MyObject>();

            using (var sr = new System.IO.StreamReader(@".\csv_test1.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();
                    var fields = line.Split(',');

                    var obj = new MyObject(fields[0], fields[1]);

                    MyObjects.Add(obj);
                }

                sr.Close();
            }
        }
    }

    class MyObject
    {
        private string m_Col1;
        private string m_Col2;
        public string Col1 { get { return m_Col1; } }
        public string Col2 { get { return m_Col2; } }

        public MyObject(string col1, string col2)
        {
            this.m_Col1 = col1;
            this.m_Col2 = col2;
        }
    }

This is fairly similar to how I would have handled reading a CSV file in the past. Of course, having the benefit of nice clean data certainly helps. Lets look at another example.
A more complicated excel dataset

Saving this dataset as a CSV file and looking at the contents presents a somewhat different picture from before.

Col1,Col2
"2,500",cake
"$2,500.00 ",is
"""twenty-five hundred""",good
"I really like "" marks",especially
"I REALLY like """""" marks",new
"spreading "" them "" around "" here",york
"fake cell ending "", mid cell",cheesecake
"this cell
spans 2 lines",with strawberries

This demonstrates the finer details of what is acceptable in a CSV.

  • If the field contains no commas or quotation marks there’s nothing special that happens; the text is simply placed directly in the file without modification (See line 1 above)
  • If the field contains a comma the field is enclosed in quotation marks: ” (see line 2 above)
  • If the field contains a quotation mark the field is enclosed in quotation marks and the quotation marks within the field are doubled, this serves as a type of “escaping.” (see line 5 above)
  • If the field contains a line feed the field is enclosed in quotation marks (see lines 9 & 10 above)

Seems simple enough, but when I first looked into this I thought “surely there’s a good code snippet out there for this,” so off to Google I went. I found some interesting articles on using an ODBC text driver; but since I was developing within the context of a Metro Windows 8 Style UI application that idea was out. I found an assortment of pre-compiled libraries with various licensing strategies. I also found some suggestions to add a dependency on the Microsoft.VisualBasic namespace and use their TextParser, this also did not appeal to me. So I sat down at the drawing board and came up with a specification.

  1. If a field does not begin with a quotation mark the field ends once the next comma is located or the end of the line is reached
  2. If a field begins with a ” mark then the field ends once a single ” mark is identified since quotation marks within the field are doubled

My new code had to be more generic, capable of importing a CSV file regardless of the number of columns. It also had to work within the new Windows 8 Style-UI restrictions, which means I had to use Windows.Storage.FileIO rather than a StreamReader, however at the time of this writing I only have access to Visual Studio 2010 so bear with me.

The following code provides a self-contained class that imports the CSV file into a collection of Row objects, which are themselves simply a List of string values representative of each column. This is simply a proof of concept, deploying this code would likely involve a few tweaks; such as allowing for column headings, empty rows, a delimiter other than a comma, etc. However I hope this code will be useful to someone who may need some help getting started with their own CSV importing logic without potentially cumbersome licensing requirements.

    class Program
    {
        static void Main(string[] args)
        {
            var MyFile = new CSVFile(@".\csv_test2.csv");

            for (int row = 0; row < MyFile.Rows.Count; row++)
            {
                Console.WriteLine("Row {0}", row.ToString());

                for (int col = 0; col < MyFile.Rows[row].Fields.Count; col++)
                {
                    Console.WriteLine("Field {0}  -  {1}", col.ToString(), MyFile.Rows[row].Fields[col]);
                }

                Console.WriteLine();
            }


            Console.ReadKey();
        }
    }

    public sealed class CSVFile
    {
        private int m_ColumnCount;
        private List<CSVRow> m_Rows;

        public int ColumnCount { get { return m_ColumnCount; } }
        public System.Collections.ObjectModel.ReadOnlyCollection<CSVRow> Rows { get { return m_Rows.AsReadOnly(); } }


        public CSVFile(string path)
        {
            m_Rows = new List<CSVRow>();

            int curCharVal = 0;
            char curChar;
            bool inQuotes = false;
            var curField = new StringBuilder();
            var curRow = new CSVRow();

            try
            {
                using (var sr = new System.IO.StreamReader(path))
                {
                    curCharVal = sr.Read();

                    while (curCharVal >= 0)
                    {
                        //We can't be sure if the file we've received uses Line Feed (char 10) by itself, or Carriage Return / Line Feed (chars 13 / char 10) to indicate a new line
                        //what we can be sure of, however, is that we really don't care if there's a 13!
                        while (curCharVal == 13)
                        {
                            curCharVal = sr.Read();

                            if (curCharVal == -1)
                                break;
                        }

                        //Sanity check, if we ended up with a -1 due to curCharVal == 13 loop...
                        //Should never happen, but god knows what some people's CSV files look like
                        if (curCharVal == -1)
                        {
                            curRow.Fields.Add(curField.ToString());
                            curField.Clear();
                            this.m_Rows.Add(curRow);

                            break;
                        }

                        curChar = (char)curCharVal;

                        if (inQuotes)
                        {
                            //If we're in a quote enclosed field, we need to identify
                            //  if these new quotes are escaped (doubled)
                            //  and if they are, only add a single set of quotes to our
                            //  current field.  If they are not escaped, then we are
                            //  no longer in a quote enclosed field
                            if (curChar == '"')
                            {
                                curCharVal = sr.Read();

                                if (curCharVal >= 0)
                                {
                                    curChar = (char)curCharVal;

                                    if (curChar != '"')
                                    {
                                        inQuotes = false;
                                        
                                        //The new character we just imported (presumably a comma)
                                        //  will be handled once we fall through into the next if block below
                                    }
                                    else
                                    {
                                        curField.Append(curChar);
                                    }
                                }
                            }
                            else
                            {
                                curField.Append(curChar);
                            }
                        }
                        
                        //This is a separate if statement, rather than an else clause
                        //  because within the if block above, the inQuotes value could be
                        //  set to false, in which case we want to evaluate the logic
                        //  within this code block
                        if(!inQuotes)
                        {
                            if (curField.Length == 0 && curChar == '"')
                            {
                                inQuotes = true;
                            }
                            else if (curChar == ',')
                            {
                                curRow.Fields.Add(curField.ToString());
                                curField.Clear();
                            }
                            else if (curCharVal == 10)
                            {
                                curRow.Fields.Add(curField.ToString());
                                curField.Clear();

                                //We're done with this row, add it to the list and set
                                //  ourselves up for a fresh row.
                                this.m_Rows.Add(curRow);
                                curRow = new CSVRow();
                            }
                            else
                            {
                                curField.Append(curChar);
                            }
                        }


                        curCharVal = sr.Read();

                        //We just reached the end of the file.
                        //  Add the current row to the list of rows before the loop ends
                        if (curCharVal == -1)
                        {
                            curRow.Fields.Add(curField.ToString());
                            curField.Clear();
                        }
                    }
                }
            }
            catch
            {
                m_Rows.Clear();
                m_ColumnCount = 0;
            }
        }


        public sealed class CSVRow
        {
            private List<string> m_Fields;

            public List<string> Fields { get { return m_Fields; } }

            public CSVRow()
            {
                m_Fields = new List<string>();
            }
        }
    }

Here’s the output of the above code:

Row 0
Field 0  -  Col1
Field 1  -  Col2

Row 1
Field 0  -  2,500
Field 1  -  cake

Row 2
Field 0  -  $2,500.00
Field 1  -  is

Row 3
Field 0  -  "twenty-five hundred"
Field 1  -  good

Row 4
Field 0  -  I really like " marks
Field 1  -  especially

Row 5
Field 0  -  I REALLY like """ marks
Field 1  -  new

Row 6
Field 0  -  spreading " them " around " here
Field 1  -  york

Row 7
Field 0  -  fake cell ending ", mid cell
Field 1  -  cheesecake

Row 8
Field 0  -  this cell
spans 2 lines
Field 1  -  with strawberries

I’ve used code very similar to this to import 150+ MB CSV files in a matter of seconds without issue.

As always, if there are any questions; feel free to drop me a line. If you have a more elegant solution, I’d love to see that too!

Until next time.

-Alan