Index Lookup or hash table?

I recently shared out the link to my blog to a few friends and colleagues. After all, that’s how you get readers, right? Anyhow, my best friend, who happens to be a PowerShell guru, was talking to me this morning about my Index Lookup post; and his question was, “why didn’t you just use a hashtable?”

We don’t, at least I don’t, think of the Hashtable very often anymore, ever since .NET 2.0 came along and swept us off our feet with the wonderful happiness that is generics. So when a PowerShell guy says “Hashtable,” what you probably will think of from a .NET perspective is a Dictionary<TKey, TValue>; this is simply the generic version of the Hashtable of old. In simplest terms, a Hashtable provides the ability to “map” a key to a specific value; and then quickly retrieve the value when you supply the key. For an in-depth discussion of hash tables, see Wikipedia, a good discussion of the Dictionary<TKey, TValue> specifically can be found on MSDN here.

I have to admit, my confidence in my Index Lookup post was somewhat shaken when he asked me about it so matter-of-factly. Had I made a mistake, was a hash table a better solution? He and I began collaborating for a little over an hour this morning testing various scenarios, which will be the subject of this post. First, we needed a data set, so I did something I should have done years ago; I generated a CSV file containing 1 million rows of junk data, the first column of which is a semi-sequential integer ID, and then 6 columns of random string data. I’ll post a link to this dataset on the files page. The file looks like 1 million rows of this:

1,tfxnvwcyycrgth,ggqbrhvgd,ltpdhe,udygxoarwof,aiozgrspzejyyukjdm,ryelqidjspjvsqid
2,xtjloonptahderoju,zvwxqciaapjr,prpljdic,rvxiwfnyv,ycjhyzdgdoo,gbsrlpfffllpdaaigc
4,btwtfgjqlgx,nutpvpe,ysqgmqtexzvzabme,imlcytzifx,klfsbjoidqgpmfsmk,fmhofmdtrzzryhsytm
5,cxqndbtcxyxggokz,nvfsphsojlr,wolgsizoyenlkfvttsd,hglyganpyzuq,dbjla,fklnjiuijmyv
8,idrpdnzsujpzjx,gdlhtexdw,bpmrddaddhyllgya,lbvnzllbgbqclkauw,xnaq,pvfpkvs
10,jsjlbbf,ebqxc,bsxyqpfbnv,ugstwagnwhjcyytq,smksdetnsledzhfoqhv,iiajjgdsqya
11,yfiblcqbzgepckwzen,rydqklai,abwtlgetyedoqtnbdl,zjdocqjy,oxpkttbyjkdvgcba,tafqpgatjn
13,nudchjeapsvxsprthl,hcnrwitezqorjfe,phbnwxalsy,naxnth,guzcgtoffjablr,eclfxwjcczye
15,dbdvjtyk,nbhzqrqkrnl,wvbnkuwrwciphbdgfrs,dspkyxfrmrilshh,bhzzbsp,tsspnjezus
16,dgrpmoffeuwwzuyfx,trlderffqdcnjqz,dzjsmkwnkpw,supdggbbjabzjgxbc,lgwnrywffiuxto,jmvnzgcoyccxkjfmv

This structure can be represented in the following objects for a storage in a Dictionary and List respectively

    //This class can be used to represent a single row
    sealed class HashObject
    {
        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public HashObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

    //This class can be used to represent a single row
    sealed class IndexObject
    {
        //Added int property to record the object's index location within the List
        public int index { get; set; }

        public int ID { get; set; }
        public string Col1 { get; set; }
        public string Col2 { get; set; }
        public string Col3 { get; set; }
        public string Col4 { get; set; }
        public string Col5 { get; set; }
        public string Col6 { get; set; }

        public IndexObject(int id, string col1, string col2, string col3, string col4, string col5, string col6)
        {
            this.ID = id;
            this.Col1 = col1;
            this.Col2 = col2;
            this.Col3 = col3;
            this.Col4 = col4;
            this.Col5 = col5;
            this.Col6 = col6;
        }
    }

Clearly there are two different metrics to test. The first is the amount of memory used by each method, the second is the amount of time required to access a given number of objects. We’ll start with testing memory. .NET doesn’t give us many good ways to determine the amount of memory a given object or collection is using, so for our purposes I am going to simply use task manager to determine the amount of memory that the process is using. I will be running all of my code with Visual Studio 2010 as a .NET Framework 4.0 command-line application in debug mode.

First, we’ll measure the memory usage of the hash table implementation. I’ve omitted the HashObject definition for brevity.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<string, HashObject>();

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);
                }
            }            

            Console.ReadKey();
        }
    }

I repeated this process three times and took averages of both memory usage, and the amount of time required to fully load the .csv into the hashtable. The average amount of memory allocated to my application’s process was 327212 KB, and the average amount of time to load was 2763 MS. Now let’s repeat the process with the “index” version. Again, I will omit the IndexObject definition for brevity sake.

    class Program
    {
        static void Main(string[] args)
        {
            var index = new List<IndexObject>();

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');

                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }            

            Console.ReadKey();
        }
    }

Again the process was executed three times and the results averaged; memory usage was 310432 KB on average and time to load was an average of 2673 MS. In both cases the index method seems to be more effecient, but is 16780 KB of memory and 90 MS of cpu time worth it? What if the hash table can more quickly satisfy item lookups? Let’s continue onward and find out.

Here is the code snipped I am using to get a distributed list of items to perform a lookup on. Changing the modulo factor (the ‘7’ on line 11 below) to a larger number will decrease the sample size, while decreasing it will increase the sample size.

    //For the lookup speed test we are going to need two collections, since the ID/Index #s do not match up directly
    var idx_lookups = new List<int>();
    var hash_lookups = new List<int>();

    //Get the list of Keys in the hash table
    var hashkeys = hash.Keys.ToList();

    for (int i = 0; i < index.Count; i++)
    {
        //This will give us an approximately 14% sample size
        if (i % 7 == 0)
        {
            idx_lookups.Add(i);

            //This will add the corresponding "ID" to the hash_lookups
            // collection, so both the index and hash method will be looking
            // up the same items
            hash_lookups.Add(hashkeys[i]);
        }
    }

Here is the full code of my program at this point, I’m excluding the definitions for HashObject and IndexObject to save space, and they have not changed from how they were originally posted above. This code will load both the hash table, and my index lookup list; then identify approximately 14% of the records and look them up in both the hash table and the list by index. The amount of time in Ticks (1 Tick = 100 nanoseconds) will be printed to the screen for each method.

    class Program
    {
        static void Main(string[] args)
        {
            var hash = new Dictionary<int, HashObject>();
            var index = new List<IndexObject>();

            DateTime start;

            using (var sr = new System.IO.StreamReader(@".\junkdata.csv"))
            {
                while (sr.Peek() != -1)
                {
                    var line = sr.ReadLine();

                    var fields = line.Split(',');
                    
                    var hashobject = new HashObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    hash.Add(hashobject.ID, hashobject);
                    
                    var indexobject = new IndexObject(Convert.ToInt32(fields[0]), fields[1], fields[2], fields[3], fields[4], fields[5], fields[6]);
                    indexobject.index = index.Count;
                    index.Add(indexobject);
                }
            }


            //For the lookup speed test we are going to need two collections, since the ID/Index #s do not match up directly
            var idx_lookups = new List<int>();
            var hash_lookups = new List<int>();

            //Get the list of Keys in the hash table
            var hashkeys = hash.Keys.ToList();

            for (int i = 0; i < index.Count; i++)
            {
                //This will give us an approximately 14% sample size
                if (i % 7 == 0)
                {
                    idx_lookups.Add(i);

                    //This will add the corresponding "ID" to the hash_lookups
                    // collection, so both the index and hash method will be looking
                    // up the same items
                    hash_lookups.Add(hashkeys[i]);
                }
            }

            start = DateTime.Now;
            foreach (var i in hash_lookups)
            {
                var obj = hash[i];
            }
            Console.WriteLine("hashtable: {0}", (DateTime.Now - start).Ticks);


            start = DateTime.Now;
            foreach (var i in idx_lookups)
            {
                var obj = index[i];
            }
            Console.WriteLine("index: {0}", (DateTime.Now - start).Ticks);

            
            Console.ReadKey();
        }
    }

On my system, this is the output:

hashtable: 50000
index: 20000

These results seem to remain consistent across different modulo values, with the index lookup method always being between 2 and 2.5 times faster than the hash table lookup.

So in summary

  • The index lookup method uses less memory
  • The index lookup method loads more quickly
  • The index lookup method is able to access items more quickly

The index lookup method does have one big disadvantage to the hash table, which I should probably mention in all fairness since I’ve been wailing on the hash table pretty hard so far. Removing items from the collection is a very dicey proposition, since it would invalidate all of your index properties that came “after” the item you removed. It’s up to you to decide which method is most appropriate for your given development situation.

Questions? Comments? Just leave a post.

Until next time!
-Alan

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s