{
"cells": [
{
"cell_type": "markdown",
"id": "39ce98ac",
"metadata": {},
"source": [
"# Red-Black Trees\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"uOKHuYloCsI\")}}\n",
"\n",
"As we've now seen, hash tables are an efficient data structure for implementing\n",
"a map ADT. They offer amortized, expected constant-time performance—which\n",
"is a subtle guarantee because of those \"amortized\" and \"expected\" qualifiers we\n",
"have to add. Hash tables also require mutability to implement. As functional\n",
"programmers, we prefer to avoid mutability when possible.\n",
"\n",
"So, let's investigate how to implement functional maps. One of the best data\n",
"structures for that is the *red-black tree*, which is a kind of balanced binary\n",
"search tree that offers worst-case logarithmic performance. So on one hand the\n",
"performance is somewhat worse than hash tables (logarithmic vs. constant), but\n",
"on the other hand we don't have to qualify the performance with words like\n",
"\"amortized\" and \"expected\". Logarithmic is actually still plenty efficient for\n",
"even very large workloads. And, we get to avoid mutability!\n",
"\n",
"## Binary Search Trees\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"Xx9V5vrkjJA\")}}\n",
"\n",
"A **binary search tree** (BST) is a binary tree with the following\n",
"representation invariant:\n",
"\n",
"> For any node *n*, every node in the left subtree of *n* has a value less than\n",
"> *n*'s value, and every node in the right subtree of *n* has a value greater\n",
"> than *n*'s value.\n",
"\n",
"We call that the *BST invariant*.\n",
"\n",
"Here is code that implements a couple of operations on a BST:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9a01281e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"type 'a tree = Node of 'a * 'a tree * 'a tree | Leaf\n"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"val mem : 'a -> 'a tree -> bool = \n"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"val insert : 'a -> 'a tree -> 'a tree = \n"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type 'a tree = Node of 'a * 'a tree * 'a tree | Leaf\n",
"\n",
"(** [mem x t] is [true] iff [x] is a member of [t]. *)\n",
"let rec mem x = function\n",
" | Leaf -> false\n",
" | Node (y, l, r) ->\n",
" if x < y then mem x l\n",
" else if x > y then mem x r\n",
" else true\n",
"\n",
"(** [insert x t] is [t] . *)\n",
"let rec insert x = function\n",
" | Leaf -> Node (x, Leaf, Leaf)\n",
" | Node (y, l, r) as t ->\n",
" if x < y then Node (y, insert x l, r)\n",
" else if x > y then Node (y, l, insert x r)\n",
" else t"
]
},
{
"cell_type": "markdown",
"id": "6d3bb56c",
"metadata": {},
"source": [
"What is the running time of those operations? Since `insert` is just a `mem`\n",
"with an extra constant-time node creation, we focus on the `mem` operation.\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"pFUJgaYUH6o\")}}\n",
"\n",
"The running time of `mem` is $O(h)$, where $h$ is the height of the tree,\n",
"because every recursive call descends one level in the tree. What's the\n",
"worst-case height of a tree? It occurs with a tree of $n$ nodes all in a single\n",
"long branch—imagine adding the numbers 1,2,3,4,5,6,7 in order into the\n",
"tree. So the worst-case running time of `mem` is still $O(n)$, where $n$ is the\n",
"number of nodes in the tree.\n",
"\n",
"What is a good shape for a tree that would allow for fast lookup? A *perfect\n",
"binary tree* has the largest number of nodes $n$ for a given height $h$, which\n",
"is $n = 2^{h+1} - 1$. Therefore $h = \\log(n+1) - 1$, which is $O(\\log n)$.\n",
"\n",
"If a tree with $n$ nodes is kept balanced, its height is $O(\\log n)$, which\n",
"leads to a lookup operation running in time $O(\\log n)$.\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"BGkipOJdH3U\")}}\n",
"\n",
"How can we keep a tree balanced? It can become unbalanced during element\n",
"insertion or deletion. Most balanced tree schemes involve adding or deleting an\n",
"element just like in a normal binary search tree, followed by some kind of *tree\n",
"surgery* to rebalance the tree. Some examples of balanced binary search tree\n",
"data structures include:\n",
"\n",
"- AVL trees (1962)\n",
"- 2-3 trees (1970's)\n",
"- Red-black trees (1970's)\n",
"\n",
"Each of these ensures $O(\\log n)$ running time by enforcing a stronger invariant\n",
"on the data structure than just the binary search tree invariant.\n",
"\n",
"## Red-Black Trees\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"JzhG0jDxGqg\")}}\n",
"\n",
"Red-black trees are relatively simple balanced binary tree data structure. The\n",
"idea is to strengthen the representation invariant so a tree has height\n",
"logarithmic in the number of nodes $n$. To help enforce the invariant, we color\n",
"each node of the tree either *red* or *black*. Where it matters, we consider the\n",
"color of an empty tree to be black."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f7a4fd71",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"type color = Red | Black\n"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"type 'a rbtree = Leaf | Node of color * 'a * 'a rbtree * 'a rbtree\n"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type color = Red | Black\n",
"type 'a rbtree = Leaf | Node of color * 'a * 'a rbtree * 'a rbtree"
]
},
{
"cell_type": "markdown",
"id": "a3165445",
"metadata": {},
"source": [
"Here are the new conditions we add to the binary search tree representation\n",
"invariant:\n",
"\n",
"1. **Local Invariant:** There are no two adjacent red nodes along any path.\n",
"\n",
"2. **Global Invariant:** Every path from the root to a leaf has the same number\n",
" of black nodes. This number is called the *black height* (BH) of the tree.\n",
"\n",
"If a tree satisfies these two conditions, it must also be the case that every\n",
"subtree of the tree also satisfies the conditions. If a subtree violated either\n",
"of the conditions, the whole tree would also.\n",
"\n",
"Additionally, by convention the root of the tree is colored black. This does not\n",
"violate the invariants, but it also is not required by them.\n",
"\n",
"With these invariants, the longest possible path from the root to an empty node\n",
"would alternately contain red and black nodes; therefore it is at most twice as\n",
"long as the shortest possible path, which only contains black nodes. The longest\n",
"path cannot have a length greater than twice the length of the paths in a\n",
"perfect binary tree, which is $O(\\log n)$. Therefore, the tree has height\n",
"$O(\\log n)$ and the operations are all asymptotically logarithmic in the number\n",
"of nodes.\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"gDTCRj2-bCU\")}}\n",
"\n",
"How do we check for membership in red-black trees? Exactly the same way as for\n",
"general binary trees."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5f2c4a81",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"val mem : 'a -> 'a rbtree -> bool = \n"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"let rec mem x = function\n",
" | Leaf -> false\n",
" | Node (_, y, l, r) ->\n",
" if x < y then mem x l\n",
" else if x > y then mem x r\n",
" else true"
]
},
{
"cell_type": "markdown",
"id": "cc1a4632",
"metadata": {},
"source": [
"**Okasaki's Algorithm.** More interesting is the `insert` operation. As with\n",
"standard binary trees, we add a node by replacing the leaf found by the search\n",
"procedure. But what can we color that node?\n",
"\n",
"- Coloring it black could increase the black height of that path, violating the\n",
" Global Invariant.\n",
"\n",
"- Coloring it red could make it adjacent to another red node, violating the\n",
" Local Invariant.\n",
"\n",
"So neither choice is safe in general. Chris Okasaki (*Purely Functional Data\n",
"Structures*, 1999) gives an elegant algorithm that solves the problem by opting\n",
"to violate the Local Invariant, then walk up the tree to repair the violation.\n",
"Here's how it works.\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"dCBAhbIEoYM\")}}\n",
"\n",
"We always color the new node red to ensure that the Global Invariant is\n",
"preserved. However, this may destroy the Local Invariant by producing two\n",
"adjacent red nodes. In order to restore the invariant, we consider not only the\n",
"new red node and its red parent, but also its (black) grandparent.\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"igUOhpGICgA\")}}\n",
"\n",
"The next figure shows the four possible cases that can arise. In it, `a`-`d` are\n",
"possibly empty subtrees, and `x`-`z` are values stored at a node. The nodes\n",
"colors are indicated with `R` and `B`.\n",
"\n",
"```text\n",
" 1 2 3 4\n",
"\n",
" Bz Bz Bx Bx\n",
" / \\ / \\ / \\ / \\\n",
" Ry d Rx d a Rz a Ry\n",
" / \\ / \\ / \\ / \\\n",
" Rx c a Ry Ry d b Rz\n",
" / \\ / \\ / \\ / \\\n",
" a b b c b c c d\n",
"```\n",
"\n",
"Notice that in each of these trees, we've carefully labeled the values and nodes\n",
"such that the binary search tree invariant ensures the following ordering:\n",
"\n",
"```\n",
"all nodes in a\n",
" <\n",
" x\n",
" <\n",
" all nodes in b\n",
" <\n",
" y\n",
" <\n",
" all nodes in c\n",
" <\n",
" z\n",
" <\n",
" all nodes in d\n",
"```\n",
"\n",
"Therefore, we can transform the tree to restore the invariant locally by\n",
"replacing any of the above four cases with:\n",
"\n",
"```text\n",
" Ry\n",
" / \\\n",
" Bx Bz\n",
" / \\ / \\\n",
" a b c d\n",
"```\n",
"\n",
"```{tip}\n",
"To really understand Okasaki's algorithm, ensure that the last three diagrams\n",
"make sense. The choice of which labels are placed where in the first diagram is\n",
"crucial. That's what guarantees the ordering holds, hence that the final tree is\n",
"the same in all four cases.\n",
"```\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"D4FJMJUIMSw\")}}\n",
"\n",
"This balance function can be written simply and concisely using pattern\n",
"matching, where each of the four input cases is mapped to the same output case.\n",
"In addition, there is the case where the tree is left unchanged locally."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e95e535d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"val balance : color * 'a * 'a rbtree * 'a rbtree -> 'a rbtree = \n"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"let balance = function\n",
" | Black, z, Node (Red, y, Node (Red, x, a, b), c), d\n",
" | Black, z, Node (Red, x, a, Node (Red, y, b, c)), d\n",
" | Black, x, a, Node (Red, z, Node (Red, y, b, c), d)\n",
" | Black, x, a, Node (Red, y, b, Node (Red, z, c, d)) ->\n",
" Node (Red, y, Node (Black, x, a, b), Node (Black, z, c, d))\n",
" | a, b, c, d -> Node (a, b, c, d)"
]
},
{
"cell_type": "markdown",
"id": "121a1f1c",
"metadata": {},
"source": [
"This balancing transformation possibly breaks the Local Invariant one level up\n",
"in the tree, but it can be restored again at that level in the same way, and so\n",
"on up the tree. In the worst case, the process cascades all the way up to the\n",
"root, resulting in two adjacent red nodes, one of them the root. But if this\n",
"happens, we can just recolor the root black, which increases the black height by\n",
"one. The amount of work is $O(\\log n)$. The `insert` code using `balance` is as\n",
"follows:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "07126cc1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"val insert : 'a -> 'a rbtree -> 'a rbtree = \n"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"let insert x s =\n",
" let rec ins = function\n",
" | Leaf -> Node (Red, x, Leaf, Leaf)\n",
" | Node (color, y, a, b) as s ->\n",
" if x < y then balance (color, y, ins a, b)\n",
" else if x > y then balance (color, y, a, ins b)\n",
" else s\n",
" in\n",
" match ins s with\n",
" | Node (_, y, a, b) -> Node (Black, y, a, b)\n",
" | Leaf -> (* guaranteed to be nonempty *)\n",
" failwith \"RBT insert failed with ins returning leaf\""
]
},
{
"cell_type": "markdown",
"id": "48093059",
"metadata": {},
"source": [
"{{ video_embed | replace(\"%%VID%%\", \"giSzhfuTMMA\")}}\n",
"\n",
"**The remove operation.** Removing an element from a red-black tree works\n",
"analogously. We start with a BST element removal and then do rebalancing. When\n",
"an interior (nonleaf) node is removed, we simply splice it out if it has fewer\n",
"than two nonleaf children; if it has two nonleaf children, we find the next\n",
"value in the tree, which must be found inside its right child.\n",
"\n",
"But, balancing the trees during removal from red-black tree requires considering\n",
"more cases. Deleting a black element from the tree creates the possibility that\n",
"some path in the tree has too few black nodes, breaking the Global Invariant.\n",
"\n",
"Germane and Might invented an elegant algorithm to handle that rebalancing Their\n",
"solution is to create \"doubly-black\" nodes that count twice in determining the\n",
"black height. For more, read their paper: [*Deletion: The Curse of the Red-Black\n",
"Tree* *Journal of Functional Programming*], volume 24, issue 4, July 2014.\n",
"\n",
"[gm]: https://doi.org/10.1017/S0956796814000227\n",
"\n",
"## Maps and Sets from BSTs\n",
"\n",
"{{ video_embed | replace(\"%%VID%%\", \"B_e8Qr4nl4A\")}}\n",
"\n",
"It's easy to use a BST to implement either a map or a set ADT:\n",
"\n",
"- For a map, just store a binding at each node. The nodes are ordered by the\n",
" keys. The values are irrelevant to the ordering.\n",
"\n",
"- For a set, just store an element at each node. The nodes are ordered by the\n",
" elements.\n",
"\n",
"The OCaml standard library does this for the `Map` and `Set` modules. It uses a\n",
"balanced BST that is a variant of an AVL tree. AVL trees are balanced BSTs in\n",
"which the height of paths is allowed to vary by at most 1. The OCaml standard\n",
"library modifies that to allow the height to vary by at most 2. Like red-black\n",
"trees, they achieve worst-case logarithmic performance.\n",
"\n",
"Now that we have a functional map data structure, how does it compare to our\n",
"imperative version, the hash table?\n",
"\n",
"- **Persistence:** Our red-black trees are persistent, but hash tables are\n",
" ephemeral.\n",
"\n",
"- **Performance:** We get guaranteed worst-case logarithmic performance with\n",
" red-black trees, but amortized, expected constant-time with hash tables.\n",
" That's somewhat hard to compare given all the modifiers involved. It's also an\n",
" example of a general phenomenon that persistent data structures often have to\n",
" pay an extra logarithmic cost over the equivalent ephemeral data structures.\n",
"\n",
"- **Convenience:** We have to provide an ordering function for balanced binary\n",
" trees, and a hash function for hash tables. Most libraries provide a default\n",
" hash function for convenience. But the performance of the hash table does\n",
" depend on that hash function truly distributing keys randomly over buckets. If\n",
" it doesn't, the \"expected\" part of the performance guarantee for hash tables\n",
" is violated. So the convenience is a double-edged sword.\n",
"\n",
"There isn't a clear winner here. Since the OCaml library provides both `Map` and\n",
"`Hashtbl`, you get to choose."
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"formats": "md:myst",
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.13,
"jupytext_version": "1.10.3"
}
},
"kernelspec": {
"display_name": "OCaml",
"language": "OCaml",
"name": "ocaml-jupyter"
},
"language_info": {
"codemirror_mode": "text/x-ocaml",
"file_extension": ".ml",
"mimetype": "text/x-ocaml",
"name": "OCaml",
"nbconverter_exporter": null,
"pygments_lexer": "OCaml",
"version": "4.14.0"
},
"source_map": [
14,
49,
67,
113,
116,
146,
153,
238,
246,
256,
269
]
},
"nbformat": 4,
"nbformat_minor": 5
}