{
"cells": [
{
"cell_type": "markdown",
"id": "fe2c4669",
"metadata": {},
"source": [
"# Hash Tables\n",
"\n",
"The *hash table* is a widely used data structure whose performance relies upon\n",
"mutability. The implementation of a hash table is quite involved compared to\n",
"other data structures we've implemented so far. We'll build it up slowly, so\n",
"that the need for and use of each piece can be appreciated.\n",
"\n",
"## Maps\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"hr8SmQK8ld8\")}}\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"I5E_BPkE_fE\")}}\n",
"\n",
"Hash tables implement the *map* data abstraction. A map binds *keys* to\n",
"*values*. This abstraction is so useful that it goes by many other names, among\n",
"them *associative array*, *dictionary*, and *symbol table*. We'll write maps\n",
"abstractly (i.e, mathematically; not actually OCaml syntax) as { $k_1 : v_1,\n",
"k_2: v_2, \\ldots, k_n : v_n$ }. Each $k : v$ is a *binding* of key $k$ to value\n",
"$v$. Here are a couple of examples:\n",
"\n",
"* A map binding a course number to something about it: {3110 : \"Fun\", 2110 :\n",
" \"OO\"}.\n",
"\n",
"* A map binding a university name to the year it was chartered: {\"Harvard\" :\n",
" 1636, \"Princeton\" : 1746, \"Penn\": 1740, \"Cornell\" : 1865}.\n",
"\n",
"The order in which the bindings are abstractly written does not matter, so the\n",
"first example might also be written {2110 : \"OO\", 3110 : \"Fun\"}. That's why we\n",
"use set braces—they suggest that the bindings are a set, with no ordering\n",
"implied.\n",
"\n",
"```{note}\n",
"As that notation suggests, maps and sets are very similar. Data structures that\n",
"can implement a set can also implement a map, and vice-versa:\n",
"\n",
"* Given a map data structure, we can treat the keys as elements of a set, and\n",
" simply ignore the values which the keys are bound to. This admittedly wastes a\n",
" little space, because we never need the values.\n",
"\n",
"* Given a set data structure, we can store key–value pairs as the\n",
" elements. Searching for elements (hence insertion and removal) might become\n",
" more expensive, because the set abstraction is unlikely to support searching\n",
" for keys by themselves.\n",
"```\n",
"\n",
"Here is an interface for maps:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "dcc7fb7a",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"module type Map =\n",
" sig\n",
" type ('k, 'v) t\n",
" val insert : 'k -> 'v -> ('k, 'v) t -> ('k, 'v) t\n",
" val find : 'k -> ('k, 'v) t -> 'v option\n",
" val remove : 'k -> ('k, 'v) t -> ('k, 'v) t\n",
" val empty : ('k, 'v) t\n",
" val of_list : ('k * 'v) list -> ('k, 'v) t\n",
" val bindings : ('k, 'v) t -> ('k * 'v) list\n",
" end\n"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"module type Map = sig\n",
"\n",
" (** [('k, 'v) t] is the type of maps that bind keys of type\n",
" ['k] to values of type ['v]. *)\n",
" type ('k, 'v) t\n",
"\n",
" (** [insert k v m] is the same map as [m], but with an additional\n",
" binding from [k] to [v]. If [k] was already bound in [m],\n",
" that binding is replaced by the binding to [v] in the new map. *)\n",
" val insert : 'k -> 'v -> ('k, 'v) t -> ('k, 'v) t\n",
"\n",
" (** [find k m] is [Some v] if [k] is bound to [v] in [m],\n",
" and [None] if not. *)\n",
" val find : 'k -> ('k, 'v) t -> 'v option\n",
"\n",
" (** [remove k m] is the same map as [m], but without any binding of [k].\n",
" If [k] was not bound in [m], then the map is unchanged. *)\n",
" val remove : 'k -> ('k, 'v) t -> ('k, 'v) t\n",
"\n",
" (** [empty] is the empty map. *)\n",
" val empty : ('k, 'v) t\n",
"\n",
" (** [of_list lst] is a map containing the same bindings as\n",
" association list [lst].\n",
" Requires: [lst] does not contain any duplicate keys. *)\n",
" val of_list : ('k * 'v) list -> ('k, 'v) t\n",
"\n",
" (** [bindings m] is an association list containing the same\n",
" bindings as [m]. There are no duplicates in the list. *)\n",
" val bindings : ('k, 'v) t -> ('k * 'v) list\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "feb988a8",
"metadata": {},
"source": [
"Next, we're going to examine three implementations of maps based on\n",
"\n",
"- association lists,\n",
"\n",
"- arrays, and\n",
"\n",
"- a combination of the above known as a *hash table with chaining*.\n",
"\n",
"Each implementation will need a slightly different interface, because of\n",
"constraints resulting from the underlying representation type. In each case\n",
"we'll pay close attention to the AF, RI, and efficiency of the operations.\n",
"\n",
"## Maps as Association Lists\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"6JUcwUgAHl8\")}}\n",
"\n",
"The simplest implementation of a map in OCaml is as an association list. We've\n",
"seen that representation twice so far [[1]][assoc-list] [[2]][map-module]. Here\n",
"is an implementation of `Map` using it:\n",
"\n",
"[assoc-list]: ../data/assoc_list\n",
"[map-module]: ../modules/functional_data_structures"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "1bdf1dc5",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"module ListMap : Map\n"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"module ListMap : Map = struct\n",
" (** AF: [[(k1, v1); (k2, v2); ...; (kn, vn)]] is the map {k1 : v1, k2 : v2,\n",
" ..., kn : vn}. If a key appears more than once in the list, then in the\n",
" map it is bound to the left-most occurrence in the list. For example,\n",
" [[(k, v1); (k, v2)]] represents {k : v1}. The empty list represents\n",
" the empty map.\n",
" RI: none. *)\n",
" type ('k, 'v) t = ('k * 'v) list\n",
"\n",
" (** Efficiency: O(1). *)\n",
" let insert k v m = (k, v) :: m\n",
"\n",
" (** Efficiency: O(n). *)\n",
" let find = List.assoc_opt\n",
"\n",
" (** Efficiency: O(n). *)\n",
" let remove k lst = List.filter (fun (k', _) -> k <> k') lst\n",
"\n",
" (** Efficiency: O(1). *)\n",
" let empty = []\n",
"\n",
" (** Efficiency: O(1). *)\n",
" let of_list lst = lst\n",
"\n",
" (** [keys m] is a list of the keys in [m], without\n",
" any duplicates.\n",
" Efficiency: O(n log n). *)\n",
" let keys m = m |> List.map fst |> List.sort_uniq Stdlib.compare\n",
"\n",
" (** [binding m k] is [(k, v)], where [v] is the value that [k]\n",
" binds in [m].\n",
" Requires: [k] is a key in [m].\n",
" Efficiency: O(n). *)\n",
" let binding m k = (k, List.assoc k m)\n",
"\n",
" (** Efficiency: O(n log n) + O(n) * O(n), which is O(n^2). *)\n",
" let bindings m = List.map (binding m) (keys m)\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "ef8d7bb4",
"metadata": {},
"source": [
"{{ video_embed | replace(\"%%VID%%\", \"yZkQhcIM0OA\")}}\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"5aZNbVTXmtE\")}}\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"bKFfD3oHKTE\")}}\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"ek2Obhfx064\")}}\n",
"\n",
"## Maps as Arrays\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"cUEN8sFVkS4\")}}\n",
"\n",
"*Mutable maps* are maps whose bindings may be mutated. The interface for a\n",
"mutable map therefore differs from a immutable map. Insertion and removal\n",
"operations for a mutable map therefore return `unit`, because they do not\n",
"produce a new map but instead mutate an existing map.\n",
"\n",
"An array can be used to represent a mutable map whose keys are integers. A\n",
"binding from a key to a value is stored by using the key as an index into the\n",
"array, and storing the binding at that index. For example, we could use an array\n",
"to map office numbers to their occupants:\n",
"\n",
"|Office|Occupant|\n",
"|-|-|\n",
"|459|Fan|\n",
"|460|Gries|\n",
"|461|Clarkson|\n",
"|462|Muhlberger|\n",
"|463|*does not exist*|\n",
"\n",
"This kind of map is called a *direct address table*. Since arrays have a fixed\n",
"size, the implementer now needs to know the client's desire for the *capacity*\n",
"of the table (i.e., the number of bindings that can be stored in it) whenever an\n",
"empty table is created. That leads to the following interface:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "019d8d23",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"module type DirectAddressMap =\n",
" sig\n",
" type 'v t\n",
" val insert : int -> 'v -> 'v t -> unit\n",
" val find : int -> 'v t -> 'v option\n",
" val remove : int -> 'v t -> unit\n",
" val create : int -> 'v t\n",
" val of_list : int -> (int * 'v) list -> 'v t\n",
" val bindings : 'v t -> (int * 'v) list\n",
" end\n"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"module type DirectAddressMap = sig\n",
" (** [t] is the type of maps that bind keys of type int to values of\n",
" type ['v]. *)\n",
" type 'v t\n",
"\n",
" (** [insert k v m] mutates map [m] to bind [k] to [v]. If [k] was\n",
" already bound in [m], that binding is replaced by the binding to\n",
" [v] in the new map. Requires: [k] is in bounds for [m]. *)\n",
" val insert : int -> 'v -> 'v t -> unit\n",
"\n",
" (** [find k m] is [Some v] if [k] is bound to [v] in [m], and [None]\n",
" if not. Requires: [k] is in bounds for [m]. *)\n",
" val find : int -> 'v t -> 'v option\n",
"\n",
" (** [remove k m] mutates [m] to remove any binding of [k]. If [k] was\n",
" not bound in [m], then the map is unchanged. Requires: [k] is in\n",
" bounds for [m]. *)\n",
" val remove : int -> 'v t -> unit\n",
"\n",
" (** [create c] creates a map with capacity [c]. Keys [0] through [c-1]\n",
" are _in bounds_ for the map. *)\n",
" val create : int -> 'v t\n",
"\n",
" (** [of_list c lst] is a map containing the same bindings as\n",
" association list [lst] and with capacity [c]. Requires: [lst] does\n",
" not contain any duplicate keys, and every key in [lst] is in\n",
" bounds for capacity [c]. *)\n",
" val of_list : int -> (int * 'v) list -> 'v t\n",
"\n",
" (** [bindings m] is an association list containing the same bindings\n",
" as [m]. There are no duplicate keys in the list. *)\n",
" val bindings : 'v t -> (int * 'v) list\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "d04e480b",
"metadata": {},
"source": [
"{{ video_embed | replace(\"%%VID%%\", \"eDd9i-imDYo\")}}\n",
"\n",
"Here is an implementation of that interface:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ee887655",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"module ArrayMap : DirectAddressMap\n"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"module ArrayMap : DirectAddressMap = struct\n",
" (** AF: [[|Some v0; Some v1; ... |]] represents {0 : v0, 1 : v1, ...}.\n",
" If element [i] of the array is instead [None], then [i] is not\n",
" bound in the map.\n",
" RI: None. *)\n",
" type 'v t = 'v option array\n",
"\n",
" (** Efficiency: O(1) *)\n",
" let insert k v a = a.(k) <- Some v\n",
"\n",
" (** Efficiency: O(1) *)\n",
" let find k a = a.(k)\n",
"\n",
" (** Efficiency: O(1) *)\n",
" let remove k a = a.(k) <- None\n",
"\n",
" (** Efficiency: O(c) *)\n",
" let create c = Array.make c None\n",
"\n",
" (** Efficiency: O(c) *)\n",
" let of_list c lst =\n",
" (* O(c) *)\n",
" let a = create c in\n",
" (* O(c) * O(1) = O(c) *)\n",
" List.iter (fun (k, v) -> insert k v a) lst;\n",
" a\n",
"\n",
" (** Efficiency: O(c) *)\n",
" let bindings a =\n",
" let bs = ref [] in\n",
" (* O(1) *)\n",
" let add_binding k v =\n",
" match v with None -> () | Some v -> bs := (k, v) :: !bs\n",
" in\n",
" (* O(c) *)\n",
" Array.iteri add_binding a;\n",
" !bs\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "a7802c78",
"metadata": {},
"source": [
"Its efficiency is great! The `insert`, `find`, and `remove` operations are\n",
"constant time. But that comes at the expense of forcing keys to be integers.\n",
"Moreover, they need to be small integers (or at least integers from a small\n",
"range), otherwise the arrays we use will need to be huge.\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"mrpti_Guevs\")}}\n",
"\n",
"## Maps as Hash Tables\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"NyZ07rpq7tk\")}}\n",
"\n",
"Arrays offer constant time performance, but come with severe restrictions on\n",
"keys. Association lists don't place those restrictions on keys, but they also\n",
"don't offer constant time performance. Is there a way to get the best of both\n",
"worlds? Yes (more or less)! *Hash tables* are the solution.\n",
"\n",
"The key idea is that we assume the existence of a *hash function* `hash : 'a ->\n",
"int` that can convert any key to a non-negative integer. Then we can use that\n",
"function to index into an array, as we did with direct address tables. Of\n",
"course, we want the hash function itself to run in constant time, otherwise the\n",
"operations that use it would not be efficient.\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"8IJpySZ5iLM\")}}\n",
"\n",
"That leads to the following interface, in which the client of the hash table has\n",
"to pass in a hash function when a table is created:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c8ec5022",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"module type TableMap =\n",
" sig\n",
" type ('k, 'v) t\n",
" val insert : 'k -> 'v -> ('k, 'v) t -> unit\n",
" val find : 'k -> ('k, 'v) t -> 'v option\n",
" val remove : 'k -> ('k, 'v) t -> unit\n",
" val create : ('k -> int) -> int -> ('k, 'v) t\n",
" val bindings : ('k, 'v) t -> ('k * 'v) list\n",
" val of_list : ('k -> int) -> ('k * 'v) list -> ('k, 'v) t\n",
" end\n"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"module type TableMap = sig\n",
" (** [('k, 'v) t] is the type of mutable table-based maps that bind\n",
" keys of type ['k] to values of type ['v]. *)\n",
" type ('k, 'v) t\n",
"\n",
" (** [insert k v m] mutates map [m] to bind [k] to [v]. If [k] was\n",
" already bound in [m], that binding is replaced by the binding to\n",
" [v]. *)\n",
" val insert : 'k -> 'v -> ('k, 'v) t -> unit\n",
"\n",
" (** [find k m] is [Some v] if [m] binds [k] to [v], and [None] if [m]\n",
" does not bind [k]. *)\n",
" val find : 'k -> ('k, 'v) t -> 'v option\n",
"\n",
" (** [remove k m] mutates [m] to remove any binding of [k]. If [k] was\n",
" not bound in [m], the map is unchanged. *)\n",
" val remove : 'k -> ('k, 'v) t -> unit\n",
"\n",
" (** [create hash c] creates a new table map with capacity [c] that\n",
" will use [hash] as the function to convert keys to integers.\n",
" Requires: The output of [hash] is always non-negative, and [hash]\n",
" runs in constant time. *)\n",
" val create : ('k -> int) -> int -> ('k, 'v) t\n",
"\n",
" (** [bindings m] is an association list containing the same bindings\n",
" as [m]. *)\n",
" val bindings : ('k, 'v) t -> ('k * 'v) list\n",
"\n",
" (** [of_list hash lst] creates a map with the same bindings as [lst],\n",
" using [hash] as the hash function. Requires: [lst] does not\n",
" contain any duplicate keys. *)\n",
" val of_list : ('k -> int) -> ('k * 'v) list -> ('k, 'v) t\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "42916420",
"metadata": {},
"source": [
"One immediate problem with this idea is what to do if the output of the hash is\n",
"not within the bounds of the array. It's easy to solve this: if `a` is the\n",
"length of the array then computing `(hash k) mod a` will return an index that is\n",
"within bounds.\n",
"\n",
"Another problem is what to do if the hash function is not *injective*, meaning\n",
"that it is not one-to-one. Then multiple keys could *collide* and need to be\n",
"stored at the same index in the array. That's okay! We deliberately allow that.\n",
"But it does mean we need a strategy for what to do when keys collide.\n",
"\n",
"There are two well-known strategies for dealing with collisions. One is to store\n",
"multiple bindings at each array index. The array elements are called *buckets*.\n",
"Typically, the bucket is implemented as a linked list. This strategy is known by\n",
"many names, including *chaining*, *closed addressing*, and *open hashing*. We'll\n",
"use **chaining** as the name. To check whether an element is in the hash table,\n",
"the key is first hashed to find the correct bucket to look in. Then, the linked\n",
"list is scanned to see if the desired element is present. If the linked list is\n",
"short, this scan is very quick. An element is added or removed by hashing it to\n",
"find the correct bucket. Then, the bucket is checked to see if the element is\n",
"there, and finally the element is added or removed appropriately from the bucket\n",
"in the usual way for linked lists.\n",
"\n",
"The other strategy is to store bindings at places other than their proper\n",
"location according to the hash. When adding a new binding to the hash table\n",
"would create a collision, the insert operation instead finds an empty location\n",
"in the array to put the binding. This strategy is (confusingly) known as\n",
"*probing*, *open addressing*, and *closed hashing*. We'll use **probing** as the\n",
"name. A simple way to find an empty location is to search ahead through the\n",
"array indices with a fixed stride (often 1), looking for an unused entry; this\n",
"*linear probing* strategy tends to produce a lot of clustering of elements in\n",
"the table, leading to bad performance. A better strategy is to use a second hash\n",
"function to compute the probing interval; this strategy is called *double\n",
"hashing*. Regardless of how probing is implemented, however, the time required\n",
"to search for or add an element grows rapidly as the hash table fills up.\n",
"\n",
"Chaining has often been preferred over probing in software implementations,\n",
"because it's easy to implement the linked lists in software. Hardware\n",
"implementations have often used probing, when the size of the table is fixed by\n",
"circuitry. But some modern software implementations are re-examining the\n",
"performance benefits of probing.\n",
"\n",
"### Chaining Representation\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"LOWAC3WOl6Q\")}}\n",
"\n",
"Here is a representation type for a hash table that uses chaining:\n",
"\n",
"```ocaml\n",
"type ('k, 'v) t = {\n",
" hash : 'k -> int;\n",
" mutable size : int;\n",
" mutable buckets : ('k * 'v) list array\n",
"}\n",
"```\n",
"\n",
"The `buckets` array has elements that are association lists, which store the\n",
"bindings. The `hash` function is used to determine which bucket a key goes into.\n",
"The `size` is used to keep track of the number of bindings currently in the\n",
"table, since that would be expensive to compute by iterating over `buckets`.\n",
"\n",
"Here are the AF and RI:\n",
"```\n",
" (** AF: If [buckets] is\n",
" [| [(k11,v11); (k12,v12); ...];\n",
" [(k21,v21); (k22,v22); ...];\n",
" ... |]\n",
" that represents the map\n",
" {k11:v11, k12:v12, ...,\n",
" k21:v21, k22:v22, ..., ...}.\n",
" RI: No key appears more than once in array (so, no\n",
" duplicate keys in association lists). All keys are\n",
" in the right buckets: if [k] is in [buckets] at index\n",
" [b] then [hash(k) = b]. The output of [hash] must always\n",
" be non-negative. [hash] must run in constant time.*)\n",
"```\n",
"\n",
"What would the efficiency of `insert`, `find`, and `remove` be for this rep\n",
"type? All require\n",
"\n",
"- hashing the key (constant time),\n",
"- indexing into the appropriate bucket (constant time), and\n",
"- finding out whether the key is already in the association list (linear in the\n",
" number of elements in that list).\n",
"\n",
"So the efficiency of the hash table depends on the number of elements in each\n",
"bucket. That, in turn, is determined by how well the hash function distributes\n",
"keys across all the buckets.\n",
"\n",
"A terrible hash function, such as the constant function `fun k -> 42`, would put\n",
"all keys into same bucket. Then every operation would be linear in the number\n",
"$n$ of bindings in the map—that is, $O(n)$. We definitely don't want\n",
"that.\n",
"\n",
"Instead, we want hash functions that distribute keys more or less randomly\n",
"across the buckets. Then the expected length of every bucket will be about the\n",
"same. If we could arrange that, on average, the bucket length were a constant\n",
"$L$, then `insert`, `find`, and `remove` would all in expectation run in time\n",
"$O(L)$.\n",
"\n",
"### Resizing\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"mn2pDfusFyY\")}}\n",
"\n",
"How could we arrange buckets to have expected constant length? To answer\n",
"that, let's think about the number of bindings and buckets in the table. Define\n",
"the *load factor* of the table to be\n",
"\n",
"$$\n",
"\\frac{\\mbox{number of bindings}}{\\mbox{number of buckets}}\n",
"$$\n",
"\n",
"So a table with 20 bindings and 10 buckets has a load factor of 2, and a table\n",
"with 10 bindings and 20 buckets has a load factor of 0.5. The load factor is\n",
"therefore the average number of bindings in a bucket. So if we could keep the\n",
"load factor constant, we could keep $L$ constant, thereby keeping the\n",
"performance to (expected) constant time.\n",
"\n",
"Toward that end, note that the number of bindings is not under the control of\n",
"the hash table implementer—but the number of buckets is. So by changing\n",
"the number of buckets, the implementer can change the load factor. A common\n",
"strategy is to keep the load factor from approximately 1/2 to 2. Then each\n",
"bucket contains only a couple bindings, and expected constant-time performance\n",
"is guaranteed.\n",
"\n",
"There's no way for the implementer to know in advance, though, exactly how many\n",
"buckets will be needed. So instead, the implementer will have to *resize* the\n",
"bucket array whenever the load factor gets too high. Typically the newly\n",
"allocated bucket will be of a size to restore the load factor to about 1.\n",
"\n",
"Putting those two ideas together, if the load factor reaches 2, then there are\n",
"twice as many bindings as buckets in the table. So by doubling the size of the\n",
"array, we can restore the load factor to 1. Similarly, if the load factor\n",
"reaches 1/2, then there are twice as many buckets as bindings, and halving the\n",
"size of the array will restore the load factor to 1.\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"BzusuFH1tNw\")}}\n",
"\n",
"Resizing the bucket array to become larger is an essential technique for hash\n",
"tables. Resizing it to become smaller, though, is not essential. As long as the\n",
"load factor is bounded by a constant from above, we can achieve expected\n",
"constant bucket length. So not all implementations will reduce the size of the\n",
"array. Although doing so would recover some space, it might not be worth the\n",
"effort. That's especially true if the size of the hash table cycles over time:\n",
"although sometimes it becomes smaller, eventually it becomes bigger again.\n",
"\n",
"Unfortunately, resizing would seem to ruin our expected constant-time\n",
"performance though. Insertion of a binding might cause the load factor to go\n",
"over 2, thus causing a resize. When the resize occurs, all the existing bindings\n",
"must be rehashed and added to the new bucket array. Thus, insertion has become a\n",
"worst-case linear time operation! The same is true for removal, if we resize the\n",
"array to become smaller when the load factor is too low.\n",
"\n",
"### Implementation\n",
"\n",
"The implementation of a hash table, below, puts together all the pieces we\n",
"discussed above."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e9674279",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"module HashMap : TableMap\n"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"module HashMap : TableMap = struct\n",
"\n",
" (** AF and RI: above *)\n",
" type ('k, 'v) t = {\n",
" hash : 'k -> int;\n",
" mutable size : int;\n",
" mutable buckets : ('k * 'v) list array\n",
" }\n",
"\n",
" (** [capacity tab] is the number of buckets in [tab].\n",
" Efficiency: O(1) *)\n",
" let capacity {buckets} =\n",
" Array.length buckets\n",
"\n",
" (** [load_factor tab] is the load factor of [tab], i.e., the number of\n",
" bindings divided by the number of buckets. *)\n",
" let load_factor tab =\n",
" float_of_int tab.size /. float_of_int (capacity tab)\n",
"\n",
" (** Efficiency: O(n) *)\n",
" let create hash n =\n",
" {hash; size = 0; buckets = Array.make n []}\n",
"\n",
" (** [index k tab] is the index at which key [k] should be stored in the\n",
" buckets of [tab].\n",
" Efficiency: O(1) *)\n",
" let index k tab =\n",
" (tab.hash k) mod (capacity tab)\n",
"\n",
" (** [insert_no_resize k v tab] inserts a binding from [k] to [v] in [tab]\n",
" and does not resize the table, regardless of what happens to the\n",
" load factor.\n",
" Efficiency: expected O(L) *)\n",
" let insert_no_resize k v tab =\n",
" let b = index k tab in (* O(1) *)\n",
" let old_bucket = tab.buckets.(b) in\n",
" tab.buckets.(b) <- (k,v) :: List.remove_assoc k old_bucket; (* O(L) *)\n",
" if not (List.mem_assoc k old_bucket) then\n",
" tab.size <- tab.size + 1;\n",
" ()\n",
"\n",
" (** [rehash tab new_capacity] replaces the buckets array of [tab] with a new\n",
" array of size [new_capacity], and re-inserts all the bindings of [tab]\n",
" into the new array. The keys are re-hashed, so the bindings will\n",
" likely land in different buckets.\n",
" Efficiency: O(n), where n is the number of bindings. *)\n",
" let rehash tab new_capacity =\n",
" (* insert [(k, v)] into [tab] *)\n",
" let rehash_binding (k, v) =\n",
" insert_no_resize k v tab\n",
" in\n",
" (* insert all bindings of bucket into [tab] *)\n",
" let rehash_bucket bucket =\n",
" List.iter rehash_binding bucket\n",
" in\n",
" let old_buckets = tab.buckets in\n",
" tab.buckets <- Array.make new_capacity []; (* O(n) *)\n",
" tab.size <- 0;\n",
" (* [rehash_binding] is called by [rehash_bucket] once for every binding *)\n",
" Array.iter rehash_bucket old_buckets (* expected O(n) *)\n",
"\n",
" (* [resize_if_needed tab] resizes and rehashes [tab] if the load factor\n",
" is too big or too small. Load factors are allowed to range from\n",
" 1/2 to 2. *)\n",
" let resize_if_needed tab =\n",
" let lf = load_factor tab in\n",
" if lf > 2.0 then\n",
" rehash tab (capacity tab * 2)\n",
" else if lf < 0.5 then\n",
" rehash tab (capacity tab / 2)\n",
" else ()\n",
"\n",
" (** Efficiency: O(n) *)\n",
" let insert k v tab =\n",
" insert_no_resize k v tab; (* O(L) *)\n",
" resize_if_needed tab (* O(n) *)\n",
"\n",
" (** Efficiency: expected O(L) *)\n",
" let find k tab =\n",
" List.assoc_opt k tab.buckets.(index k tab)\n",
"\n",
" (** [remove_no_resize k tab] removes [k] from [tab] and does not trigger\n",
" a resize, regardless of what happens to the load factor.\n",
" Efficiency: expected O(L) *)\n",
" let remove_no_resize k tab =\n",
" let b = index k tab in\n",
" let old_bucket = tab.buckets.(b) in\n",
" tab.buckets.(b) <- List.remove_assoc k tab.buckets.(b);\n",
" if List.mem_assoc k old_bucket then\n",
" tab.size <- tab.size - 1;\n",
" ()\n",
"\n",
" (** Efficiency: O(n) *)\n",
" let remove k tab =\n",
" remove_no_resize k tab; (* O(L) *)\n",
" resize_if_needed tab (* O(n) *)\n",
"\n",
" (** Efficiency: O(n) *)\n",
" let bindings tab =\n",
" Array.fold_left\n",
" (fun acc bucket ->\n",
" List.fold_left\n",
" (* 1 cons for every binding, which is O(n) *)\n",
" (fun acc (k,v) -> (k,v) :: acc)\n",
" acc bucket)\n",
" [] tab.buckets\n",
"\n",
" (** Efficiency: O(n^2) *)\n",
" let of_list hash lst =\n",
" let m = create hash (List.length lst) in (* O(n) *)\n",
" List.iter (fun (k, v) -> insert k v m) lst; (* n * O(n) is O(n^2) *)\n",
" m\n",
"end"
]
},
{
"cell_type": "markdown",
"id": "247bb379",
"metadata": {},
"source": [
"{{ video_embed | replace(\"%%VID%%\", \"FN-YyNaSkz8\")}}\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"Du4SxDJzS6g\")}}\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"GKtcy5AfPgc\")}}\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"YQUHqv-RXI8\")}}\n",
"\n",
"An optimization of `rehash` is possible. When it calls `insert_no_resize` to\n",
"re-insert a binding, extra work is being done: there's no need for that\n",
"insertion to call `remove_assoc` or `mem_assoc`, because we are guaranteed the\n",
"binding does not contain a duplicate key. We could omit that work. If the hash\n",
"function is good, it's only a constant amount of work that we save. But if the\n",
"hash function is bad and doesn't distribute keys uniformly, that could be an\n",
"important optimization.\n",
"\n",
"## Hash Functions\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"tGktnJWmCy0\")}}\n",
"\n",
"Hash tables are one of the most useful data structures ever invented.\n",
"Unfortunately, they are also one of the most misused. Code built using hash\n",
"tables often falls far short of achievable performance. There are two reasons\n",
"for this:\n",
"\n",
"- Clients choose poor hash functions that do not distribute keys randomly over\n",
" buckets.\n",
"\n",
"- Hash table abstractions do not adequately specify what is required of the hash\n",
" function, or make it difficult to provide a good hash function.\n",
"\n",
"Clearly, a bad hash function can destroy our attempts at a constant running\n",
"time. A lot of obvious hash function choices are bad. For example, if we're\n",
"mapping names to phone numbers, then hashing each name to its length would be a\n",
"very poor function, as would a hash function that used only the first name, or\n",
"only the last name. We want our hash function to use all of the information in\n",
"the key. This is a bit of an art. While hash tables are extremely effective when\n",
"used well, all too often poor hash functions are used that sabotage performance.\n",
"\n",
"Hash tables work well when the hash function looks random. If it is to look\n",
"random, this means that any change to a key, even a small one, should change the\n",
"bucket index in an apparently random way. If we imagine writing the bucket index\n",
"as a binary number, a small change to the key should randomly flip the bits in\n",
"the bucket index. This is called information *diffusion*. For example, a one-bit\n",
"change to the key should cause every bit in the index to flip with 1/2\n",
"probability.\n",
"\n",
"**Client vs. implementer.** As we've described it, the hash function is a single\n",
"function that maps from the key type to a bucket index. In practice, the hash\n",
"function is the composition of *two* functions, one provided by the client and\n",
"one by the implementer. This is because the implementer doesn't understand the\n",
"element type, the client doesn't know how many buckets there are, and the\n",
"implementer probably doesn't trust the client to achieve diffusion.\n",
"\n",
"The client function `hash_c` first converts the key into an integer hash code,\n",
"and the implementation function `hash_i` converts the hash code into a bucket\n",
"index. The actual hash function is the composition of these two functions. As a\n",
"hash table designer, you need to figure out which of the client hash function\n",
"and the implementation hash function is going to provide diffusion. If clients\n",
"are sufficiently savvy, it makes sense to push the diffusion onto them, leaving\n",
"the hash table implementation as simple and fast as possible. The easy way to\n",
"accomplish this is to break the computation of the bucket index into three\n",
"steps.\n",
"\n",
"1. Serialization: Transform the key into a stream of bytes that contains all of\n",
" the information in the original key. Two equal keys must result in the same\n",
" byte stream. Two byte streams should be equal only if the keys are actually\n",
" equal. How to do this depends on the form of the key. If the key is a\n",
" string, then the stream of bytes would simply be the characters of the\n",
" string.\n",
"\n",
"2. Diffusion: Map the stream of bytes into a large integer *x* in a way that\n",
" causes every change in the stream to affect the bits of *x* apparently\n",
" randomly. There is a tradeoff in performance versus randomness (and\n",
" security) here.\n",
"\n",
"3. Compression: Reduce that large integer to be within the range of the\n",
" buckets. For example, compute the hash bucket index as *x* mod *m*. This\n",
" is particularly cheap if *m* is a power of two.\n",
"\n",
"Unfortunately, hash table implementations are rarely forthcoming about what they\n",
"assume of client hash functions. So it can be hard to know, as a client, how to\n",
"get good performance from a table. The more information the implementation can\n",
"provide to a client about how well distributed keys are in buckets, the better.\n",
"\n",
"## Standard Library `Hashtbl`\n",
"\n",
"Although it's great to know how to implement a hash table, and to see how\n",
"mutability is used in doing so, it's also great *not* to have to implement a\n",
"data structure yourself in your own projects. Fortunately the OCaml standard\n",
"library does provide a module `Hashtbl` [sic] that implements hash tables. You\n",
"can think of this module as the imperative equivalent of the functional `Map`\n",
"module.\n",
"\n",
"**Hash function.** The function `Hashtbl.hash : 'a -> int` takes responsibility\n",
"for serialization and diffusion. It is capable of hashing any type of value.\n",
"That includes not just integers but strings, lists, trees, and so forth. So how\n",
"does it run in constant time, if the length of a tree or size of a tree can be\n",
"arbitrarily large? It looks only at a predetermined number of *meaningful nodes*\n",
"of the structure it is hashing. By default, that number is 10. A meaningful node\n",
"is an integer, floating-point number, string, character, booleans or constant\n",
"constructor. You can see that as we hash these lists:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e9cc2eb4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"- : int = 635296333\n"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"- : int = 822221246\n"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"- : int = 822221246\n"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"- : int = 822221246\n"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Hashtbl.hash [1; 2; 3; 4; 5; 6; 7; 8; 9];;\n",
"Hashtbl.hash [1; 2; 3; 4; 5; 6; 7; 8; 9; 10];;\n",
"Hashtbl.hash [1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11];;\n",
"Hashtbl.hash [1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12];;"
]
},
{
"cell_type": "markdown",
"id": "8de3472e",
"metadata": {},
"source": [
"The hash values stop changing after the list goes beyond 10 elements. That has\n",
"implications for how we use this built-in hash function: it will not necessarily\n",
"provide good diffusion for large data structures, which means performance could\n",
"degrade as collisions become common. To support clients who want to hash such\n",
"structures, `Hashtbl` provides another function `hash_param` which can be\n",
"configured to examine more nodes.\n",
"\n",
"**Hash table.** Here's an abstract of the hash table interface:\n",
"\n",
"```ocaml\n",
"module type Hashtbl = struct\n",
" type ('a, 'b) t\n",
" val create : int -> ('a, 'b) t\n",
" val add : ('a, 'b) t -> 'a -> 'b -> unit\n",
" val find : ('a, 'b) t -> 'a -> 'b\n",
" val remove : ('a, 'b) t -> 'a -> unit\n",
" ...\n",
"end\n",
"```\n",
"\n",
"The representation type `('a, 'b) Hashtbl.t` maps keys of type `'a` to values of\n",
"type `'b`. The `create` function initializes a hash table to have a given\n",
"capacity, as our implementation above did. But rather than requiring the client\n",
"to provide a hash function, the module uses `Hashtbl.hash`.\n",
"\n",
"Resizing occurs when the load factor exceeds 2. Let's see that happen. First,\n",
"we'll create a table and fill it up:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d6b0056b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"val t : ('_weak1, '_weak2) Hashtbl.t = \n"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"- : unit = ()\n"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open Hashtbl;;\n",
"let t = create 16;;\n",
"for i = 1 to 16 do\n",
" add t i (string_of_int i)\n",
"done;;"
]
},
{
"cell_type": "markdown",
"id": "dfa9edb1",
"metadata": {},
"source": [
"We can query the hash table to find out how the bindings are distributed\n",
"over buckets with `Hashtbl.stats`:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8db1d7e0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"- : Hashtbl.statistics =\n",
"{num_bindings = 16; num_buckets = 16; max_bucket_length = 3;\n",
" bucket_histogram = [|6; 5; 4; 1|]}\n"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stats t"
]
},
{
"cell_type": "markdown",
"id": "fdfb8eed",
"metadata": {},
"source": [
"The number of bindings and number of buckets are equal, so the load factor is 1.\n",
"The bucket histogram is an array `a` in which `a.(i)` is the number of buckets whose size is `i`.\n",
"\n",
"Let's pump up the load factor to 2:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "8eefb98a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"- : unit = ()\n"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"- : Hashtbl.statistics =\n",
"{num_bindings = 32; num_buckets = 16; max_bucket_length = 4;\n",
" bucket_histogram = [|3; 3; 3; 5; 2|]}\n"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"for i = 17 to 32 do\n",
" add t i (string_of_int i)\n",
"done;;\n",
"stats t;;"
]
},
{
"cell_type": "markdown",
"id": "a482c26f",
"metadata": {},
"source": [
"Now adding one more binding will trigger a resize, which doubles the number of\n",
"buckets:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fc8220a8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"- : unit = ()\n"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"- : Hashtbl.statistics =\n",
"{num_bindings = 33; num_buckets = 32; max_bucket_length = 3;\n",
" bucket_histogram = [|11; 11; 8; 2|]}\n"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"add t 33 \"33\";;\n",
"stats t;;"
]
},
{
"cell_type": "markdown",
"id": "d05202fb",
"metadata": {},
"source": [
"But `Hashtbl` does not implement resize on removal:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "65c605e3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"- : unit = ()\n"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"- : Hashtbl.statistics =\n",
"{num_bindings = 0; num_buckets = 32; max_bucket_length = 0;\n",
" bucket_histogram = [|32|]}\n"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"for i = 1 to 33 do\n",
" remove t i\n",
"done;;\n",
"stats t;;"
]
},
{
"cell_type": "markdown",
"id": "61b5873b",
"metadata": {},
"source": [
"The number of buckets is still 32, even though all bindings have been removed.\n",
"\n",
"```{note}\n",
"Java's `HashMap` has a default constructor `HashMap()` that creates an empty\n",
"hash table with a capacity of 16 that resizes when the load factor exceeds 0.75\n",
"rather than 2. So Java hash tables would tend to have a shorter bucket length\n",
"than OCaml hash tables, but also would tend to take more space to store because\n",
"of empty buckets.\n",
"```\n",
"\n",
"**Client-provided hash functions.** What if a client of `Hashtbl` found that the\n",
"default hash function was leading to collisions, hence poor performance? Then\n",
"it would make sense to change to a different hash function. To support that,\n",
"`Hashtbl` provides a functorial interface similar to `Map`. The functor\n",
"is `Hashtbl.Make`, and it requires an input of the following module type:\n",
"\n",
"```ocaml\n",
"module type HashedType = sig\n",
" type t\n",
" val equal : t -> t -> bool\n",
" val hash : t -> int\n",
"end\n",
"```\n",
"\n",
"Type `t` is the key type for the table, and the two functions `equal` and `hash`\n",
"say how to compare keys for equality and how to hash them. If two keys are equal\n",
"according to `equal`, they must have the same hash value according to `hash`. If\n",
"that requirement were violated, the hash table would no longer operate\n",
"correctly. For example, suppose that `equal k1 k2` holds but\n",
"`hash k1 <> hash k2`. Then `k1` and `k2` would be stored in different buckets.\n",
"So if a client added a binding of `k1` to `v`, then looked up `k2`, they would\n",
"not get `v` back.\n",
"\n",
"```{note}\n",
"That final requirement might sound familiar from Java. There, if you override\n",
"`Object.equals()` and `Object.hashCode()` you must ensure the same\n",
"correspondence.\n",
"```"
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"formats": "md:myst",
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.13,
"jupytext_version": "1.10.3"
}
},
"kernelspec": {
"display_name": "OCaml",
"language": "OCaml",
"name": "ocaml-jupyter"
},
"language_info": {
"codemirror_mode": "text/x-ocaml",
"file_extension": ".ml",
"mimetype": "text/x-ocaml",
"name": "OCaml",
"nbconverter_exporter": null,
"pygments_lexer": "OCaml",
"version": "4.14.0"
},
"source_map": [
14,
63,
96,
121,
161,
198,
233,
239,
279,
308,
342,
501,
616,
721,
726,
756,
762,
767,
769,
776,
781,
786,
789,
793,
798
]
},
"nbformat": 4,
"nbformat_minor": 5
}