Thinking in Java 11: Collections of Objects

Thinking in Java
Prev	Contents / Index	Next

Hashing and hash codes

In Statistics.java, a standard library class (Integer) was used as a key for the HashMap. It worked because it has all the necessary wiring to make it behave correctly as a key. But a common pitfall occurs with HashMaps when you create your own classes to be used as keys. For example, consider a weather predicting system that matches Groundhog objects to Prediction objects. It seems fairly straightforward—you create the two classes, and use Groundhog as the key and Prediction as the value:

//: c11:Groundhog.java
// Looks plausible, but doesn't work as a HashMap key.

public class Groundhog {
  protected int number;
  public Groundhog(int n) { number = n; }
  public String toString() {
    return "Groundhog #" + number;
  }
} ///:~

//: c11:Prediction.java
// Predicting the weather with groundhogs.

public class Prediction {
  private boolean shadow = Math.random() > 0.5;
  public String toString() {
    if(shadow)
      return "Six more weeks of Winter!";
    else
      return "Early Spring!";
  }
} ///:~

//: c11:SpringDetector.java
// What will the weather be?
import com.bruceeckel.simpletest.*;
import java.util.*;
import java.lang.reflect.*;

public class SpringDetector {
  private static Test monitor = new Test();
  // Uses a Groundhog or class derived from Groundhog:
  public static void
  detectSpring(Class groundHogClass) throws Exception {
    Constructor ghog = groundHogClass.getConstructor(
      new Class[] {int.class});
    Map map = new HashMap();
    for(int i = 0; i < 10; i++)
      map.put(ghog.newInstance(
        new Object[]{ new Integer(i) }), new Prediction());
    System.out.println("map = " + map + "\n");
    Groundhog gh = (Groundhog)
      ghog.newInstance(new Object[]{ new Integer(3) });
    System.out.println("Looking up prediction for " + gh);
    if(map.containsKey(gh))
      System.out.println((Prediction)map.get(gh));
    else
      System.out.println("Key not found: " + gh);
  }
  public static void main(String[] args) throws Exception {
    detectSpring(Groundhog.class);
    monitor.expect(new String[] {
      "%% map = \\{(Groundhog #\\d=" +
      "(Early Spring!|Six more weeks of Winter!)" +
      "(, )?){10}\\}",
      "",
      "Looking up prediction for Groundhog #3",
      "Key not found: Groundhog #3"
    });
    }
} ///:~

Each Groundhog is given an identity number, so you can look up a Prediction in the HashMap by saying, “Give me the Prediction associated with Groundhog #3.” The Prediction class contains a boolean that is initialized using Math.random( ) and a toString( ) that interprets the result for you. The detectSpring( ) method is created using reflection to instantiate and use the Class Groundhog or any derived class. This will come in handy when we inherit a new type of Groundhog to solve the problem demonstrated here. A HashMap is filled with Groundhogs and their associated Predictions. The HashMap is printed so that you can see it has been filled. Then a Groundhog with an identity number of 3 is used as a key to look up the prediction for Groundhog #3 (which you can see must be in the Map).

It seems simple enough, but it doesn’t work. The problem is that Groundhog is inherited from the common root class Object (which is what happens if you don’t specify a base class, thus all classes are ultimately inherited from Object). It is Object’s hashCode( ) method that is used to generate the hash code for each object, and by default it just uses the address of its object. Thus, the first instance of Groundhog(3) does not produce a hash code equal to the hash code for the second instance of Groundhog(3) that we tried to use as a lookup.

You might think that all you need to do is write an appropriate override for hashCode( ). But it still won’t work until you’ve done one more thing: override the equals( ) that is also part of Object. equals( ) is used by the HashMap when trying to determine if your key is equal to any of the keys in the table.

A proper equals( ) must satisfy the following five conditions:

Reflexive: For any x, x.equals(x) should return true.
Symmetric: For any x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
Transitive: For any x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
Consistent: For any x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the object is modified.
For any non-null x, x.equals(null) should return false.

Again, the default Object.equals( ) simply compares object addresses, so one Groundhog(3) is not equal to another Groundhog(3). Thus, to use your own classes as keys in a HashMap, you must override both hashCode( ) and equals( ), as shown in the following solution to the groundhog problem:

//: c11:Groundhog2.java
// A class that's used as a key in a HashMap
// must override hashCode() and equals().

public class Groundhog2 extends Groundhog {
  public Groundhog2(int n) { super(n); }
  public int hashCode() { return number; }
  public boolean equals(Object o) {
    return (o instanceof Groundhog2)
      && (number == ((Groundhog2)o).number);
  }
} ///:~

//: c11:SpringDetector2.java
// A working key.
import com.bruceeckel.simpletest.*;
import java.util.*;

public class SpringDetector2 {
  private static Test monitor = new Test();
  public static void main(String[] args) throws Exception {
    SpringDetector.detectSpring(Groundhog2.class);
    monitor.expect(new String[] {
      "%% map = \\{(Groundhog #\\d=" +
      "(Early Spring!|Six more weeks of Winter!)" +
      "(, )?){10}\\}",
      "",
      "Looking up prediction for Groundhog #3",
      "%% Early Spring!|Six more weeks of Winter!"
    });
  }
} ///:~

Groundhog2.hashCode( ) returns the groundhog number as a hash value. In this example, the programmer is responsible for ensuring that no two groundhogs exist with the same ID number. The hashCode( ) is not required to return a unique identifier (something you’ll understand better later in this chapter), but the equals( ) method must be able to strictly determine whether two objects are equivalent. Here, equals( ) is based on the groundhog number, so if two Groundhog2 objects exist as keys in the HashMap with the same groundhog number, it will fail.

Even though it appears that the equals( ) method is only checking to see whether the argument is an instance of Groundhog2 (using the instanceof keyword, which was explained in Chapter 10), the instanceof actually quietly does a second sanity check to see if the object is null, since instanceof produces false if the left-hand argument is null. Assuming it’s the correct type and not null, the comparison is based on the actual ghNumbers. You can see from the output that the behavior is now correct.

When creating your own class to use in a HashSet, you must pay attention to the same issues as when it is used as a key in a HashMap.

Understanding hashCode( )

The preceding example is only a start toward solving the problem correctly. It shows that if you do not override hashCode( ) and equals( ) for your key, the hashed data structure (HashSet, HashMap, LinkedHashSet, or LinkedHashMap) will not be able to deal with your key properly. However, to get a good solution for the problem you need to understand what’s going on inside the hashed data structure.

First, consider the motivation behind hashing: you want to look up an object using another object. But you can accomplish this with a TreeSet or TreeMap, too. It’s also possible to implement your own Map. To do so, the Map.entrySet( ) method must be supplied to produce a set of Map.Entry objects. MPair will be defined as the new type of Map.Entry. In order for it to be placed in a TreeSet, it must implement equals( ) and be Comparable:

//: c11:MPair.java
// A new type of Map.Entry.
import java.util.*;

public class MPair implements Map.Entry, Comparable {
  private Object key, value;
  public MPair(Object k, Object v) {
    key = k;
    value = v;
  }
  public Object getKey() { return key; }
  public Object getValue() { return value; }
  public Object setValue(Object v) {
    Object result = value;
    value = v;
    return result;
  }
  public boolean equals(Object o) {
    return key.equals(((MPair)o).key);
  }
  public int compareTo(Object rv) {
    return ((Comparable)key).compareTo(((MPair)rv).key);
  }
} ///:~

Notice that the comparisons are only interested in the keys, so duplicate values are perfectly acceptable.

The following example implements a Map using a pair of ArrayLists:

//: c11:SlowMap.java
// A Map implemented with ArrayLists.
import com.bruceeckel.simpletest.*;
import java.util.*;
import com.bruceeckel.util.*;

public class SlowMap extends AbstractMap {
  private static Test monitor = new Test();
  private List
    keys = new ArrayList(),
    values = new ArrayList();
  public Object put(Object key, Object value) {
    Object result = get(key);
    if(!keys.contains(key)) {
      keys.add(key);
      values.add(value);
    } else
      values.set(keys.indexOf(key), value);
    return result;
  }
  public Object get(Object key) {
    if(!keys.contains(key))
      return null;
    return values.get(keys.indexOf(key));
  }
  public Set entrySet() {
    Set entries = new HashSet();
    Iterator
      ki = keys.iterator(),
      vi = values.iterator();
    while(ki.hasNext())
      entries.add(new MPair(ki.next(), vi.next()));
    return entries;
  }
  public String toString() {
    StringBuffer s = new StringBuffer("{");
    Iterator
      ki = keys.iterator(),
      vi = values.iterator();
    while(ki.hasNext()) {
      s.append(ki.next() + "=" + vi.next());
      if(ki.hasNext()) s.append(", ");
    }
    s.append("}");
    return s.toString();
  }
  public static void main(String[] args) {
    SlowMap m = new SlowMap();
    Collections2.fill(m, Collections2.geography, 15);
    System.out.println(m);
    monitor.expect(new String[] {
      "{ALGERIA=Algiers, ANGOLA=Luanda, BENIN=Porto-Novo,"+
      " BOTSWANA=Gaberone, BURKINA FASO=Ouagadougou, " +
      "BURUNDI=Bujumbura, CAMEROON=Yaounde, " +
      "CAPE VERDE=Praia, CENTRAL AFRICAN REPUBLIC=Bangui,"+
      " CHAD=N'djamena, COMOROS=Moroni, " +
      "CONGO=Brazzaville, DJIBOUTI=Dijibouti, " +
      "EGYPT=Cairo, EQUATORIAL GUINEA=Malabo}"
    });
  }
} ///:~

The put( ) method simply places the keys and values in corresponding ArrayLists. In main( ), a SlowMap is loaded and then printed to show that it works.

The whole point of hashing is speed: Hashing allows the lookup to happen quickly. Since the bottleneck is in the speed of the key lookup, one of the solutions to the problem could be to keep the keys sorted and then use Collections.binarySearch( ) to perform the lookup (an exercise at the end of this chapter will walk you through this process).

Hashing goes further by saying that all you want to do is to store the key somewhere so that it can be quickly found. As you’ve seen in this chapter, the fastest structure in which to store a group of elements is an array, so that will be used for representing the key information (note carefully that I said “key information,” and not the key itself). Also seen in this chapter was the fact that an array, once allocated, cannot be resized, so we have a problem: We want to be able to store any number of values in the Map, but if the number of keys is fixed by the array size, how can this be?

The answer is that the array will not hold the keys. From the key object, a number will be derived that will index into the array. This number is the hash code, produced by the hashCode( ) method (in computer science parlance, this is the hash function) defined in Object and presumably overridden by your class. To solve the problem of the fixed-size array, more than one key may produce the same index. That is, there may be collisions. Because of this, it doesn’t matter how big the array is; each key object will land somewhere in that array.

So the process of looking up a value starts by computing the hash code and using it to index into the array. If you could guarantee that there were no collisions (which could be possible if you have a fixed number of values) then you’d have a perfect hashing function, but that’s a special case. In all other cases, collisions are handled by external chaining: The array points not directly to a value, but instead to a list of values. These values are searched in a linear fashion using the equals( ) method. Of course, this aspect of the search is much slower, but if the hash function is good, there will only be a few values in each slot. So instead of searching through the entire list, you quickly jump to a slot where you have to compare a few entries to find the value. This is much faster, which is why the HashMap is so quick.

Knowing the basics of hashing, it’s possible to implement a simple hashed Map:

//: c11:SimpleHashMap.java
// A demonstration hashed Map.
import java.util.*;
import com.bruceeckel.util.*;

public class SimpleHashMap extends AbstractMap {
  // Choose a prime number for the hash table
  // size, to achieve a uniform distribution:
  private static final int SZ = 997;
  private LinkedList[] bucket = new LinkedList[SZ];
  public Object put(Object key, Object value) {
    Object result = null;
    int index = key.hashCode() % SZ;
    if(index < 0) index = -index;
    if(bucket[index] == null)
      bucket[index] = new LinkedList();
    LinkedList pairs = bucket[index];
    MPair pair = new MPair(key, value);
    ListIterator it = pairs.listIterator();
    boolean found = false;
    while(it.hasNext()) {
      Object iPair = it.next();
      if(iPair.equals(pair)) {
        result = ((MPair)iPair).getValue();
        it.set(pair); // Replace old with new
        found = true;
        break;
      }
    }
    if(!found)
      bucket[index].add(pair);
    return result;
  }
  public Object get(Object key) {
    int index = key.hashCode() % SZ;
    if(index < 0) index = -index;
    if(bucket[index] == null) return null;
    LinkedList pairs = bucket[index];
    MPair match = new MPair(key, null);
    ListIterator it = pairs.listIterator();
    while(it.hasNext()) {
      Object iPair = it.next();
      if(iPair.equals(match))
        return ((MPair)iPair).getValue();
    }
    return null;
  }
  public Set entrySet() {
    Set entries = new HashSet();
    for(int i = 0; i < bucket.length; i++) {
      if(bucket[i] == null) continue;
      Iterator it = bucket[i].iterator();
      while(it.hasNext())
        entries.add(it.next());
    }
    return entries;
  }
  public static void main(String[] args) {
    SimpleHashMap m = new SimpleHashMap();
    Collections2.fill(m, Collections2.geography, 25);
    System.out.println(m);
  }
} ///:~

Because the “slots” in a hash table are often referred to as buckets, the array that represents the actual table is called bucket. To promote even distribution, the number of buckets is typically a prime number.^[59] Notice that it is an array of LinkedList, which automatically provides for collisions; each new item is simply added to the end of the list.

The return value of put( ) is null or, if the key was already in the list, the old value associated with that key. The return value is result, which is initialized to null, but if a key is discovered in the list, then result is assigned to that key.

For both put( ) and get( ), the first thing that happens is that the hashCode( ) is called for the key, and the result is forced to a positive number. Then it is forced to fit into the bucket array using the modulus operator and the size of the array. If that location is null, it means there are no elements that hash to that location, so a new LinkedList is created to hold the object that just did. However, the normal process is to look through the list to see if there are duplicates, and if there are, the old value is put into result and the new value replaces the old. The found flag keeps track of whether an old key-value pair was found and, if not, the new pair is appended to the end of the list.

In get( ), you’ll see very similar code as that contained in put( ), but simpler. The index is calculated into the bucket array, and if a LinkedList exists, it is searched for a match.

entrySet( ) must find and traverse all the lists, adding them to the result Set. Once this method has been created, the Map can be tested by filling it with values and then printing them.

HashMap performance factors

To understand the issues, some terminology is necessary:

Capacity: The number of buckets in the table.

Initial capacity: The number of buckets when the table is created. HashMap and HashSet have constructors that allow you to specify the initial capacity.

Size: The number of entries currently in the table.

Load factor: size/capacity. A load factor of 0 is an empty table, 0.5 is a half-full table, etc. A lightly loaded table will have few collisions and so is optimal for insertions and lookups (but will slow down the process of traversing with an iterator). HashMap and HashSet have constructors that allow you to specify the load factor, which means that when this load factor is reached, the container will automatically increase the capacity (the number of buckets) by roughly doubling it and will redistribute the existing objects into the new set of buckets (this is called rehashing).

The default load factor used by HashMap is 0.75 (it doesn’t rehash until the table is ¾ full). This seems to be a good trade-off between time and space costs. A higher load factor decreases the space required by the table but increases the lookup cost, which is important because lookup is what you do most of the time (including both get( ) and put( )).

If you know that you’ll be storing many entries in a HashMap, creating it with an appropriately large initial capacity will prevent the overhead of automatic rehashing.^[60]

Thinking in Java

Prev Contents / Index Next