Tag Archives: hash table

Index lookup or hash table? Part 2

I ended my last blog post touting the performance advantages of index lookup versus a hash table for data loading as well as data retrieving, however I added a caveat regarding the issue of removing items from the collection. With the hash table it’s pretty straightforward, Microsoft has provided a built-in method for removing an item based on its key that is relatively inexpensive to use. Removing an item from the index collection requires that we iterate through all of the items that came after the item we just removed and update their index value to represent their new position within the collection. If we have a separate client / server design, we must then also synchronize those index changes back to the client so that future calls from the client end up referencing the correct object.

Just how expensive is this maintenance; could index lookup still be a viable solution for collections that have a moderate to heavy amount of removals? Lets find out.

I’m using the same objects and junk csv file as before for my data set. See the files page to acquire the JunkData file for your own testing. Recall that this junk data set is 1 million rows of comma-separated data, the first column of which serves as a numeric row identifier, while the subsequent 6 columns contain random string data. For reference, here are the HashObject and IndexObject definitions again for reference.

    //This class can be used to represent a single row
    sealed class HashObject
    {
        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public HashObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

    //This class can be used to represent a single row
    sealed class IndexObject
    {
        //Added int property to record the object's index location within the List
        public int index { get; set; }

        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public IndexObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

And here is the code I am using to test performance of item removal for both the index lookup and hashtable methods.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<int, HashObject>();
            var index = new List<IndexObject>();

            DateTime start;
            TimeSpan duration;

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);

                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }


            //For the removal speed test we are going to need two collections, since the ID/Index #s do not match up directly
            var idx_removes = new List<int>();
            var hash_removes = new List<int>();

            //Get the list of Keys in the hash table
            var hashkeys = hash.Keys.ToList();


            //Identify set of items to remove

            //End identify set of items to remove

            start = DateTime.Now;
            foreach (var i in hash_removes)
            {
                RemoveHash(i, hash);
            }
            duration = DateTime.Now - start;
            Console.WriteLine("hashtable: {0}", duration.Ticks);


            start = DateTime.Now;
            foreach (var i in idx_removes)
            {
                RemoveIndex(i, index);
            }
            duration = DateTime.Now - start;
            Console.WriteLine("index: {0}", duration.Ticks);



            Console.ReadKey();
        }

        static void RemoveHash(int ID, Dictionary<int, HashObject> hash)
        {
            hash.Remove(ID);
        }

        static void RemoveIndex(int idx, List<IndexObject> index)
        {
            //First we remove the indicated item
            index.RemoveAt(idx);

            //Removing the item will "shift" all  of the following items 1 position towards the
            //  "front" of the List.  Since the old "idx" value was removed, the value that now sits
            //  in the position of idx will also need to have its index value decrimented.
            for (int i = idx; i < index.Count; i++)
            {
                index[i].index--;
            }   
        }
    }

Take special note of lines 37-39 in the code above. From here on out will post several different code snippets to place in that section, rather than posting the code for the entire program over and over. These snippets will test the effects of several different scenarios.

  • Removing an item at or near the “end” of the collection
  • Removing an item at or near the “beginning” of the collection
  • Removing an item at or near the “middle” of the collection

Once again for all of my tests I will be working in Visual Studio 2010 with a .NET 4.0 Command-Line Application in debug mode.

We’ll start with two different tests that remove items near the end of the collection. Test 1 will actually remove the “last” item, and Test 2 will remove an item that is 20000 positions from the end. Here is the code snippet for Test 1.

    idx_removes.Add(index.Count - 1);
    hash_removes.Add(hashkeys[index.Count - 1]);

And here are the results

hashtable: 0
index: 0

Moving right along to test #2, here’s the code.

    idx_removes.Add(index.Count - 20001);
    hash_removes.Add(hashkeys[index.Count - 20001]);

And the results

hashtable: 10032
index: 10032

As long as we’re working with items near the end of the collection, the index collection and hash table seem to be neck and neck.

In the next scenario, we’ll test the reverse situation from what we just tested. We’ll remove the “first” item, as well as an item that is 20000 positions into the collection from the beginning. Here is the code snipped for the first test.

    idx_removes.Add(0);
    hash_removes.Add(hashkeys[0]);

And here are the results

hashtable: 10032
index: 1003200

Moving right on to the second test of this scenario. Here’s the code.

    idx_removes.Add(19999);
    hash_removes.Add(hashkeys[19999]);

And the results

hashtable: 10032
index: 1003200

It’s beginning to become clear that the Remove method built into .NET’s hash table implementation has a fairly low cost regardless of the items position within the collection, while the index collection suffers from a variable amount of overhead that depends on how near to the end of the collection the item being removed is. For our final scenario, we’ll perform a removal from the middle of the collection.

    idx_removes.Add(499999);
    hash_removes.Add(hashkeys[499999]);

And the results

hashtable: 10012
index: 490588

The hash table is the clear winner when it comes to removing items. While index lookup does have an advantage when it comes to the initial load and lookup of items, if you will be doing any significant number of remove operations you may want to look towards the hash table for your needs. However if item removals are rare, index lookup may still be the better choice because of its clear advantage in lookup speed.

As always, questions or comments are welcome.
-Alan

Index Lookup or hash table?

I recently shared out the link to my blog to a few friends and colleagues. After all, that’s how you get readers, right? Anyhow, my best friend, who happens to be a PowerShell guru, was talking to me this morning about my Index Lookup post; and his question was, “why didn’t you just use a hashtable?”

We don’t, at least I don’t, think of the Hashtable very often anymore, ever since .NET 2.0 came along and swept us off our feet with the wonderful happiness that is generics. So when a PowerShell guy says “Hashtable,” what you probably will think of from a .NET perspective is a Dictionary<TKey, TValue>; this is simply the generic version of the Hashtable of old. In simplest terms, a Hashtable provides the ability to “map” a key to a specific value; and then quickly retrieve the value when you supply the key. For an in-depth discussion of hash tables, see Wikipedia, a good discussion of the Dictionary<TKey, TValue> specifically can be found on MSDN here.

I have to admit, my confidence in my Index Lookup post was somewhat shaken when he asked me about it so matter-of-factly. Had I made a mistake, was a hash table a better solution? He and I began collaborating for a little over an hour this morning testing various scenarios, which will be the subject of this post. First, we needed a data set, so I did something I should have done years ago; I generated a CSV file containing 1 million rows of junk data, the first column of which is a semi-sequential integer ID, and then 6 columns of random string data. I’ll post a link to this dataset on the files page. The file looks like 1 million rows of this:

1,tfxnvwcyycrgth,ggqbrhvgd,ltpdhe,udygxoarwof,aiozgrspzejyyukjdm,ryelqidjspjvsqid
2,xtjloonptahderoju,zvwxqciaapjr,prpljdic,rvxiwfnyv,ycjhyzdgdoo,gbsrlpfffllpdaaigc
4,btwtfgjqlgx,nutpvpe,ysqgmqtexzvzabme,imlcytzifx,klfsbjoidqgpmfsmk,fmhofmdtrzzryhsytm
5,cxqndbtcxyxggokz,nvfsphsojlr,wolgsizoyenlkfvttsd,hglyganpyzuq,dbjla,fklnjiuijmyv
8,idrpdnzsujpzjx,gdlhtexdw,bpmrddaddhyllgya,lbvnzllbgbqclkauw,xnaq,pvfpkvs
10,jsjlbbf,ebqxc,bsxyqpfbnv,ugstwagnwhjcyytq,smksdetnsledzhfoqhv,iiajjgdsqya
11,yfiblcqbzgepckwzen,rydqklai,abwtlgetyedoqtnbdl,zjdocqjy,oxpkttbyjkdvgcba,tafqpgatjn
13,nudchjeapsvxsprthl,hcnrwitezqorjfe,phbnwxalsy,naxnth,guzcgtoffjablr,eclfxwjcczye
15,dbdvjtyk,nbhzqrqkrnl,wvbnkuwrwciphbdgfrs,dspkyxfrmrilshh,bhzzbsp,tsspnjezus
16,dgrpmoffeuwwzuyfx,trlderffqdcnjqz,dzjsmkwnkpw,supdggbbjabzjgxbc,lgwnrywffiuxto,jmvnzgcoyccxkjfmv

This structure can be represented in the following objects for a storage in a Dictionary and List respectively

    //This class can be used to represent a single row
    sealed class HashObject
    {
        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public HashObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

    //This class can be used to represent a single row
    sealed class IndexObject
    {
        //Added int property to record the object's index location within the List
        public int index { get; set; }

        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public IndexObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

Clearly there are two different metrics to test. The first is the amount of memory used by each method, the second is the amount of time required to access a given number of objects. We’ll start with testing memory. .NET doesn’t give us many good ways to determine the amount of memory a given object or collection is using, so for our purposes I am going to simply use task manager to determine the amount of memory that the process is using. I will be running all of my code with Visual Studio 2010 as a .NET Framework 4.0 command-line application in debug mode.

First, we’ll measure the memory usage of the hash table implementation. I’ve omitted the HashObject definition for brevity.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<string, HashObject>();

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);
                }
            }            

            Console.ReadKey();
        }
    }

I repeated this process three times and took averages of both memory usage, and the amount of time required to fully load the .csv into the hashtable. The average amount of memory allocated to my application’s process was 327212 KB, and the average amount of time to load was 2763 MS. Now let’s repeat the process with the “index” version. Again, I will omit the IndexObject definition for brevity sake.

    class Program
    {
        static void Main(string[] args)
        {
            var index = new List<IndexObject>();

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }            

            Console.ReadKey();
        }
    }

Again the process was executed three times and the results averaged; memory usage was 310432 KB on average and time to load was an average of 2673 MS. In both cases the index method seems to be more effecient, but is 16780 KB of memory and 90 MS of cpu time worth it? What if the hash table can more quickly satisfy item lookups? Let’s continue onward and find out.

Here is the code snipped I am using to get a distributed list of items to perform a lookup on. Changing the modulo factor (the ‘7’ on line 11 below) to a larger number will decrease the sample size, while decreasing it will increase the sample size.

    //For the lookup speed test we are going to need two collections, since the ID/Index #s do not match up directly
    var idx_lookups = new List<int>();
    var hash_lookups = new List<int>();

    //Get the list of Keys in the hash table
    var hashkeys = hash.Keys.ToList();

    for (int i = 0; i < index.Count; i++)
    {
        //This will give us an approximately 14% sample size
        if (i % 7 == 0)
        {
            idx_lookups.Add(i);

            //This will add the corresponding "ID" to the hash_lookups
            // collection, so both the index and hash method will be looking
            // up the same items
            hash_lookups.Add(hashkeys[i]);
        }
    }

Here is the full code of my program at this point, I’m excluding the definitions for HashObject and IndexObject to save space, and they have not changed from how they were originally posted above. This code will load both the hash table, and my index lookup list; then identify approximately 14% of the records and look them up in both the hash table and the list by index. The amount of time in Ticks (1 Tick = 100 nanoseconds) will be printed to the screen for each method.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<int, HashObject>();
            var index = new List<IndexObject>();

            DateTime start;

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');
                    
                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);
                    
                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }


            //For the lookup speed test we are going to need two collections, since the ID/Index #s do not match up directly
            var idx_lookups = new List<int>();
            var hash_lookups = new List<int>();

            //Get the list of Keys in the hash table
            var hashkeys = hash.Keys.ToList();

            for (int i = 0; i < index.Count; i++)
            {
                //This will give us an approximately 14% sample size
                if (i % 7 == 0)
                {
                    idx_lookups.Add(i);

                    //This will add the corresponding "ID" to the hash_lookups
                    // collection, so both the index and hash method will be looking
                    // up the same items
                    hash_lookups.Add(hashkeys[i]);
                }
            }

            start = DateTime.Now;
            foreach (var i in hash_lookups)
            {
                var obj = hash[i];
            }
            Console.WriteLine("hashtable: {0}", (DateTime.Now - start).Ticks);


            start = DateTime.Now;
            foreach (var i in idx_lookups)
            {
                var obj = index[i];
            }
            Console.WriteLine("index: {0}", (DateTime.Now - start).Ticks);

            
            Console.ReadKey();
        }
    }

On my system, this is the output:

hashtable: 50000
index: 20000

These results seem to remain consistent across different modulo values, with the index lookup method always being between 2 and 2.5 times faster than the hash table lookup.

So in summary

  • The index lookup method uses less memory
  • The index lookup method loads more quickly
  • The index lookup method is able to access items more quickly

The index lookup method does have one big disadvantage to the hash table, which I should probably mention in all fairness since I’ve been wailing on the hash table pretty hard so far. Removing items from the collection is a very dicey proposition, since it would invalidate all of your index properties that came “after” the item you removed. It’s up to you to decide which method is most appropriate for your given development situation.

Questions? Comments? Just leave a post.

Until next time!
-Alan