Index lookup or hash table? Part 2

I ended my last blog post touting the performance advantages of index lookup versus a hash table for data loading as well as data retrieving, however I added a caveat regarding the issue of removing items from the collection. With the hash table it’s pretty straightforward, Microsoft has provided a built-in method for removing an item based on its key that is relatively inexpensive to use. Removing an item from the index collection requires that we iterate through all of the items that came after the item we just removed and update their index value to represent their new position within the collection. If we have a separate client / server design, we must then also synchronize those index changes back to the client so that future calls from the client end up referencing the correct object.

Just how expensive is this maintenance; could index lookup still be a viable solution for collections that have a moderate to heavy amount of removals? Lets find out.

I’m using the same objects and junk csv file as before for my data set. See the files page to acquire the JunkData file for your own testing. Recall that this junk data set is 1 million rows of comma-separated data, the first column of which serves as a numeric row identifier, while the subsequent 6 columns contain random string data. For reference, here are the HashObject and IndexObject definitions again for reference.

    //This class can be used to represent a single row
    sealed class HashObject
    {
        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public HashObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

    //This class can be used to represent a single row
    sealed class IndexObject
    {
        //Added int property to record the object's index location within the List
        public int index { get; set; }

        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public IndexObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

And here is the code I am using to test performance of item removal for both the index lookup and hashtable methods.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<int, HashObject>();
            var index = new List<IndexObject>();

            DateTime start;
            TimeSpan duration;

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);

                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }


            //For the removal speed test we are going to need two collections, since the ID/Index #s do not match up directly
            var idx_removes = new List<int>();
            var hash_removes = new List<int>();

            //Get the list of Keys in the hash table
            var hashkeys = hash.Keys.ToList();


            //Identify set of items to remove

            //End identify set of items to remove

            start = DateTime.Now;
            foreach (var i in hash_removes)
            {
                RemoveHash(i, hash);
            }
            duration = DateTime.Now - start;
            Console.WriteLine("hashtable: {0}", duration.Ticks);


            start = DateTime.Now;
            foreach (var i in idx_removes)
            {
                RemoveIndex(i, index);
            }
            duration = DateTime.Now - start;
            Console.WriteLine("index: {0}", duration.Ticks);



            Console.ReadKey();
        }

        static void RemoveHash(int ID, Dictionary<int, HashObject> hash)
        {
            hash.Remove(ID);
        }

        static void RemoveIndex(int idx, List<IndexObject> index)
        {
            //First we remove the indicated item
            index.RemoveAt(idx);

            //Removing the item will "shift" all  of the following items 1 position towards the
            //  "front" of the List.  Since the old "idx" value was removed, the value that now sits
            //  in the position of idx will also need to have its index value decrimented.
            for (int i = idx; i < index.Count; i++)
            {
                index[i].index--;
            }   
        }
    }

Take special note of lines 37-39 in the code above. From here on out will post several different code snippets to place in that section, rather than posting the code for the entire program over and over. These snippets will test the effects of several different scenarios.

  • Removing an item at or near the “end” of the collection
  • Removing an item at or near the “beginning” of the collection
  • Removing an item at or near the “middle” of the collection

Once again for all of my tests I will be working in Visual Studio 2010 with a .NET 4.0 Command-Line Application in debug mode.

We’ll start with two different tests that remove items near the end of the collection. Test 1 will actually remove the “last” item, and Test 2 will remove an item that is 20000 positions from the end. Here is the code snippet for Test 1.

    idx_removes.Add(index.Count - 1);
    hash_removes.Add(hashkeys[index.Count - 1]);

And here are the results

hashtable: 0
index: 0

Moving right along to test #2, here’s the code.

    idx_removes.Add(index.Count - 20001);
    hash_removes.Add(hashkeys[index.Count - 20001]);

And the results

hashtable: 10032
index: 10032

As long as we’re working with items near the end of the collection, the index collection and hash table seem to be neck and neck.

In the next scenario, we’ll test the reverse situation from what we just tested. We’ll remove the “first” item, as well as an item that is 20000 positions into the collection from the beginning. Here is the code snipped for the first test.

    idx_removes.Add(0);
    hash_removes.Add(hashkeys[0]);

And here are the results

hashtable: 10032
index: 1003200

Moving right on to the second test of this scenario. Here’s the code.

    idx_removes.Add(19999);
    hash_removes.Add(hashkeys[19999]);

And the results

hashtable: 10032
index: 1003200

It’s beginning to become clear that the Remove method built into .NET’s hash table implementation has a fairly low cost regardless of the items position within the collection, while the index collection suffers from a variable amount of overhead that depends on how near to the end of the collection the item being removed is. For our final scenario, we’ll perform a removal from the middle of the collection.

    idx_removes.Add(499999);
    hash_removes.Add(hashkeys[499999]);

And the results

hashtable: 10012
index: 490588

The hash table is the clear winner when it comes to removing items. While index lookup does have an advantage when it comes to the initial load and lookup of items, if you will be doing any significant number of remove operations you may want to look towards the hash table for your needs. However if item removals are rare, index lookup may still be the better choice because of its clear advantage in lookup speed.

As always, questions or comments are welcome.
-Alan

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s