5.6. Functional Data Structures#
A functional data structure is one that does not make use of mutability.
It’s possible to build functional data structures both in functional languages
and in imperative languages. For example, you could build
a Java equivalent to OCaml’s list
type by creating a Node
class whose fields are immutable by virtue of using
the const
keyword.
Functional data structures have the property of being persistent: updating the data structure with one of its operations does not change the existing version of the data structure but instead produces a new version. Both exist and both can still be accessed. A good language implementation will ensure that any parts of the data structure that are not changed by an operation will be shared between the old version and the new version. Any parts that do change will be copied so that the old version may persist. The opposite of a persistent data structure is an ephemeral data structure: changes are destructive, so that only one version exists at any time. Both persistent and ephemeral data structures can be built in both functional and imperative languages.
5.6.1. Lists#
The built-in singly-linked list
data structure in OCaml is functional. We know
that, because we’ve seen how to implement it with algebraic data types. It’s
also persistent, which we can demonstrate:
let lst = [1; 2];;
let lst' = List.tl lst;;
lst;;
val lst : int list = [1; 2]
val lst' : int list = [2]
- : int list = [1; 2]
Taking the tail of lst
does not change the list. Both lst
and lst'
coexist without affecting one another.
5.6.2. Stacks#
We implemented stacks earlier in this chapter. Here’s a terse variant of one of
those implementations, in which we add a to_list
operation to make it easier to
view the contents of the stack in examples:
module type Stack = sig
type 'a t
exception Empty
val empty : 'a t
val is_empty : 'a t -> bool
val push : 'a -> 'a t -> 'a t
val peek : 'a t -> 'a
val pop : 'a t -> 'a t
val size : 'a t -> int
val to_list : 'a t -> 'a list
end
module ListStack : Stack = struct
type 'a t = 'a list
exception Empty
let empty = []
let is_empty = function [] -> true | _ -> false
let push = List.cons
let peek = function [] -> raise Empty | x :: _ -> x
let pop = function [] -> raise Empty | _ :: s -> s
let size = List.length
let to_list = Fun.id
end
Show code cell output
module type Stack =
sig
type 'a t
exception Empty
val empty : 'a t
val is_empty : 'a t -> bool
val push : 'a -> 'a t -> 'a t
val peek : 'a t -> 'a
val pop : 'a t -> 'a t
val size : 'a t -> int
val to_list : 'a t -> 'a list
end
module ListStack : Stack
That implementation is functional, as can be seen above, and also persistent:
open ListStack;;
let s = empty |> push 1 |> push 2;;
let s' = pop s;;
to_list s;;
to_list s';;
val s : int ListStack.t = <abstr>
val s' : int ListStack.t = <abstr>
- : int list = [2; 1]
- : int list = [1]
The value s
is unchanged by the pop
operation that creates s'
. Both
versions of the stack coexist.
The Stack
module type gives us a strong hint that the data structure is
persistent in the types it provides for push
and pop
:
val push : 'a -> 'a t -> 'a t
val pop : 'a t -> 'a t
Both of those take a stack as an argument and return a new stack as a result. An
ephemeral data structure usually would not bother to return a stack. In Java,
for example, similar methods might have a void
return type; the equivalent in
OCaml would be returning unit
.
5.6.3. Options vs Exceptions#
All of our stack implementations so far have raised an exception whenever peek
or pop
is applied to the empty stack. Another possibility would be to use an
option
for the return value. If the input stack is empty, then peek
and
pop
return None
; otherwise, they return Some
.
module type Stack = sig
type 'a t
val empty : 'a t
val is_empty : 'a t -> bool
val push : 'a -> 'a t -> 'a t
val peek : 'a t -> 'a option
val pop : 'a t -> 'a t option
val size : 'a t -> int
val to_list : 'a t -> 'a list
end
module ListStack : Stack = struct
type 'a t = 'a list
exception Empty
let empty = []
let is_empty = function [] -> true | _ -> false
let push = List.cons
let peek = function [] -> None | x :: _ -> Some x
let pop = function [] -> None | _ :: s -> Some s
let size = List.length
let to_list = Fun.id
end
Show code cell output
module type Stack =
sig
type 'a t
val empty : 'a t
val is_empty : 'a t -> bool
val push : 'a -> 'a t -> 'a t
val peek : 'a t -> 'a option
val pop : 'a t -> 'a t option
val size : 'a t -> int
val to_list : 'a t -> 'a list
end
module ListStack : Stack
But that makes it harder to pipeline:
ListStack.(empty |> push 1 |> pop |> peek)
File "[5]", line 1, characters 11-33:
1 | ListStack.(empty |> push 1 |> pop |> peek)
^^^^^^^^^^^^^^^^^^^^^^
Error: This expression has type int ListStack.t option
but an expression was expected of type 'a ListStack.t
The types break down for the pipeline right after the pop
, because that
now returns an 'a t option
, but peek
expects an input that is merely
an 'a t
.
It is possible to define some additional operators to help restore the ability
to pipeline. In fact, these functions are already defined in the Option
module
in the standard library, though not as infix operators:
(* Option.map aka fmap *)
let ( >>| ) opt f =
match opt with
| None -> None
| Some x -> Some (f x)
(* Option.bind *)
let ( >>= ) opt f =
match opt with
| None -> None
| Some x -> f x
val ( >>| ) : 'a option -> ('a -> 'b) -> 'b option = <fun>
val ( >>= ) : 'a option -> ('a -> 'b option) -> 'b option = <fun>
We can use those as needed for pipelining:
ListStack.(empty |> push 1 |> pop >>| push 2 >>= pop >>| push 3 >>| to_list)
- : int list option = Some [3]
But it’s not so pleasant to figure out which of the three operators to use where.
There is therefore a tradeoff in the interface design:
Using options ensures that surprising exceptions regarding empty stacks never occur at run-time. The program is therefore more robust. But the convenient pipeline operator is lost.
Using exceptions means that programmers don’t have to write as much code. If they are sure that an exception can’t occur, they can omit the code for handling it. The program is less robust, but writing it is more convenient.
There is thus a tradeoff between writing more code early (with options) or doing
more debugging later (with exceptions). The OCaml standard library has recently
begun providing both versions of the interface in a data structure, so that the
client can make the choice of how they want to use it. For example, we could
provide both peek
and peek_opt
, and the same for pop
, for clients of
our stack module:
module type Stack = sig
type 'a t
exception Empty
val empty : 'a t
val is_empty : 'a t -> bool
val push : 'a -> 'a t -> 'a t
val peek : 'a t -> 'a
val peek_opt : 'a t -> 'a option
val pop : 'a t -> 'a t
val pop_opt : 'a t -> 'a t option
val size : 'a t -> int
val to_list : 'a t -> 'a list
end
module ListStack : Stack = struct
type 'a t = 'a list
exception Empty
let empty = []
let is_empty = function [] -> true | _ -> false
let push = List.cons
let peek = function [] -> raise Empty | x :: _ -> x
let peek_opt = function [] -> None | x :: _ -> Some x
let pop = function [] -> raise Empty | _ :: s -> s
let pop_opt = function [] -> None | _ :: s -> Some s
let size = List.length
let to_list = Fun.id
end
Show code cell output
module type Stack =
sig
type 'a t
exception Empty
val empty : 'a t
val is_empty : 'a t -> bool
val push : 'a -> 'a t -> 'a t
val peek : 'a t -> 'a
val peek_opt : 'a t -> 'a option
val pop : 'a t -> 'a t
val pop_opt : 'a t -> 'a t option
val size : 'a t -> int
val to_list : 'a t -> 'a list
end
module ListStack : Stack
One nice thing about this implementation is that it is efficient. All the
operations except for size
are constant time. We saw earlier in the chapter
that size
could be made constant time as well, at the cost of some extra space
— though just a constant factor more — by caching the size of the
stack at each node in the list.
5.6.4. Queues#
Queues and stacks are fairly similar interfaces. We’ll stick with exceptions instead of options for now.
module type Queue = sig
(** An ['a t] is a queue whose elements have type ['a]. *)
type 'a t
(** Raised if [front] or [dequeue] is applied to the empty queue. *)
exception Empty
(** [empty] is the empty queue. *)
val empty : 'a t
(** [is_empty q] is whether [q] is empty. *)
val is_empty : 'a t -> bool
(** [enqueue x q] is the queue [q] with [x] added to the end. *)
val enqueue : 'a -> 'a t -> 'a t
(** [front q] is the element at the front of the queue. Raises [Empty]
if [q] is empty. *)
val front : 'a t -> 'a
(** [dequeue q] is the queue containing all the elements of [q] except the
front of [q]. Raises [Empty] is [q] is empty. *)
val dequeue : 'a t -> 'a t
(** [size q] is the number of elements in [q]. *)
val size : 'a t -> int
(** [to_list q] is a list containing the elements of [q] in order from
front to back. *)
val to_list : 'a t -> 'a list
end
Show code cell output
module type Queue =
sig
type 'a t
exception Empty
val empty : 'a t
val is_empty : 'a t -> bool
val enqueue : 'a -> 'a t -> 'a t
val front : 'a t -> 'a
val dequeue : 'a t -> 'a t
val size : 'a t -> int
val to_list : 'a t -> 'a list
end
Important
Similarly to peek
and pop
, note how front
and dequeue
divide the
responsibility of getting the first element vs. getting all the rest of the
elements.
It’s easy to implement queues with lists, just as it was for implementing stacks:
module ListQueue : Queue = struct
(** The list [x1; x2; ...; xn] represents the queue with [x1] at its front,
followed by [x2], ..., followed by [xn]. *)
type 'a t = 'a list
exception Empty
let empty = []
let is_empty = function [] -> true | _ -> false
let enqueue x q = q @ [x]
let front = function [] -> raise Empty | x :: _ -> x
let dequeue = function [] -> raise Empty | _ :: q -> q
let size = List.length
let to_list = Fun.id
end
module ListQueue : Queue
But despite being as easy, this implementation is not as efficient as our
list-based stacks. Dequeueing is a constant-time operation with this
representation, but enqueueing is a linear-time operation. That’s because
dequeue
does a single pattern match, whereas enqueue
must traverse the
entire list to append the new element at the end.
There’s a very clever way to do better on efficiency. We can use two lists to
represent a single queue. This representation was invented by Robert Melville as
part of his PhD dissertation at Cornell (Asymptotic Complexity of Iterative
Computations, Jan 1981), which was advised by Prof. David Gries. Chris Okasaki
(Purely Functional Data Structures, Cambridge University Press, 1988) calls
these batched queues. Sometimes you will see this same implementation referred
to as “implementing a queue with two stacks”. That’s because stacks and lists
are so similar (as we’ve already seen) that you could rewrite pop
as
List.tl
, and so forth.
The core idea has a Part A and a Part B. Part A is: we use the two lists to
split the queue into two pieces, the inbox and outbox. When new elements are
enqueued, we put them in the inbox. Eventually (we’ll soon come to how) elements
are transferred from the inbox to the outbox. When a dequeue is requested, that
element is removed from the outbox; or when the front element is requested, we
check the outbox for it. For example, if the inbox currently had [3; 4; 5]
and
the outbox had [1; 2]
, then the front element would be 1
, which is the head
of the outbox. Dequeuing would remove that element and leave the outbox with
just [2]
. Likewise, enqueuing 6
would make the inbox become [3; 4; 5; 6]
.
The efficiency of front
and dequeue
is very good so far. We just have to
take the head or tail of the outbox, respectively, assuming it is non-empty.
Those are constant-time operations. But the efficiency of enqueue
is still
bad. It’s linear time, because we have to append the new element to the end of
the list. It’s too bad we have to use the append operator, which is inherently
linear time. It would be much better if we could use cons, which is constant
time.
So here’s Part B of the core idea: let’s keep the inbox in reverse order. For
example, if we enqueued 3
then 4
then 5
, the inbox would actually
be [5; 4; 3]
, not [3; 4; 5]
. Then if 6
were enqueued next, we could
cons it onto the beginning of the inbox, which becomes [6; 5; 4; 3]
. The
queue represented by inbox i
and outbox o
is therefore o @ List.rev i
.
So enqueue
can now always be a constant-time operation.
But what about dequeue
(and front
)? They’re constant time too, as long as
the outbox is not empty. If it’s empty, we have a problem. We need to transfer
whatever is in the inbox to the outbox at that point. For example, if the
outbox is empty, and the inbox is [6; 5; 4; 3]
, then we need to switch them
around, making the outbox be [3; 4; 5; 6]
and the inbox be empty. That’s
actually easy: we just have to reverse the list.
Unfortunately, we just re-introduced a linear-time operation. But with one
crucial difference: we don’t have to do that linear-time reverse on every
dequeue
, whereas with ListQueue
above we had to do the linear-time append on
every enqueue
. Instead, we only have to do the reverse on those rare occasions
when the outbox becomes empty.
So even though in the worst case dequeue
(and front
) will be linear time,
most of the time they will not be. In fact, later in this book when we study
amortized analysis we will show that in the long run they can be understood as
constant-time operations. For now, here’s a piece of intuition to support that
claim: every individual element enters the inbox once (with a cons), moves to
the outbox once (with a pattern match then cons), and leaves the outbox once (with
a pattern match). Each of those is constant time. So each element only ever
experiences constant-time operations from its own perspective.
For now, let’s move on to implementing these ideas. In the implementation, we’ll
add one more idea: the outbox always has to have an element in it, unless the
queue is empty. In other words, if the outbox is empty, we’re guaranteed the
inbox is too. That requirement isn’t necessary for batched queues, but it does
keep the code simpler by reducing the number of times we have to check whether a
list is empty. The tiny tradeoff is that if the queue is empty, enqueue
now
has to directly put an element into the outbox. No matter, that’s still a
constant-time operation.
module BatchedQueue : Queue = struct
(** [{o; i}] represents the queue [o @ List.rev i]. For example,
[{o = [1; 2]; i = [5; 4; 3]}] represents the queue [1, 2, 3, 4, 5],
where [1] is the front element. To avoid ambiguity about emptiness,
whenever only one of the lists is empty, it must be [i]. For example,
[{o = [1]; i = []}] is a legal representation, but [{o = []; i = [1]}]
is not. This implies that if [o] is empty, [i] must also be empty. *)
type 'a t = {o : 'a list; i : 'a list}
exception Empty
let empty = {o = []; i = []}
let is_empty = function
| {o = []} -> true
| _ -> false
let enqueue x = function
| {o = []} -> {o = [x]; i = []}
| {o; i} -> {o; i = x :: i}
let front = function
| {o = []} -> raise Empty
| {o = h :: _} -> h
let dequeue = function
| {o = []} -> raise Empty
| {o = [_]; i} -> {o = List.rev i; i = []}
| {o = _ :: t; i} -> {o = t; i}
let size {o; i} = List.(length o + length i)
let to_list {o; i} = o @ List.rev i
end
module BatchedQueue : Queue
The efficiency of batched queues comes at a price in readability. If we compare
ListQueue
and BatchedQueue
, it’s hopefully clear that ListQueue
is a
simple and correct implementation of a queue data structure. It’s probably far
less clear that BatchedQueue
is a correct implementation. Just look at how
many paragraphs of writing it took to explain it above!
5.6.5. Maps#
Recall that a map (aka dictionary) binds keys to values. Here is a module type for maps. There are many other operations a map might support, but these will suffice for now.
module type Map = sig
(** [('k, 'v) t] is the type of maps that bind keys of type ['k] to
values of type ['v]. *)
type ('k, 'v) t
(** [empty] does not bind any keys. *)
val empty : ('k, 'v) t
(** [insert k v m] is the map that binds [k] to [v], and also contains
all the bindings of [m]. If [k] was already bound in [m], that old
binding is superseded by the binding to [v] in the returned map. *)
val insert : 'k -> 'v -> ('k, 'v) t -> ('k, 'v) t
(** [lookup k m] is the value bound to [k] in [m]. Raises: [Not_found] if [k]
is not bound in [m]. *)
val lookup : 'k -> ('k, 'v) t -> 'v
(** [bindings m] is an association list containing the same bindings as [m].
The keys in the list are guaranteed to be unique. *)
val bindings : ('k, 'v) t -> ('k * 'v) list
end
module type Map =
sig
type ('k, 'v) t
val empty : ('k, 'v) t
val insert : 'k -> 'v -> ('k, 'v) t -> ('k, 'v) t
val lookup : 'k -> ('k, 'v) t -> 'v
val bindings : ('k, 'v) t -> ('k * 'v) list
end
Note how Map.t
is parameterized on two types, 'k
and 'v
, which are written
in parentheses and separated by commas. Although ('k, 'v)
might look like a
pair of values, it is not: it is a syntax for writing multiple type variables.
Recall that association lists are lists of pairs, where the first element of each pair is a key, and the second element is the value it binds. For example, here is an association list that maps some well-known names to an approximation of their numeric value:
[("pi", 3.14); ("e", 2.718); ("phi", 1.618)]
Naturally we can implement the Map
module type with association lists:
module AssocListMap : Map = struct
(** The list [(k1, v1); ...; (kn, vn)] binds key [ki] to value [vi].
If a key appears more than once in the list, it is bound to the
the left-most occurrence in the list. *)
type ('k, 'v) t = ('k * 'v) list
let empty = []
let insert k v m = (k, v) :: m
let lookup k m = List.assoc k m
let keys m = List.(m |> map fst |> sort_uniq Stdlib.compare)
let bindings m = m |> keys |> List.map (fun k -> (k, lookup k m))
end
module AssocListMap : Map
This implementation of maps is persistent. For example, adding a new
binding to the map m
below does not change m
itself:
open AssocListMap
let m = empty |> insert "pi" 3.14 |> insert "e" 2.718
let m' = m |> insert "phi" 1.618
let b = bindings m
let b' = bindings m'
val m : (string, float) AssocListMap.t = <abstr>
val m' : (string, float) AssocListMap.t = <abstr>
val b : (string * float) list = [("e", 2.718); ("pi", 3.14)]
val b' : (string * float) list = [("e", 2.718); ("phi", 1.618); ("pi", 3.14)]
The insert
operation is constant time, which is great. But the lookup
operation is linear time. It’s possible to do much better than that. In a later
chapter, we’ll see how to do better. Logarithmic-time performance is achievable
with balanced binary trees, and something like constant-time performance with
hash tables. Neither of those, however, achieves the simplicity of the code
above.
The bindings
operation is complicated by potential duplicate keys in the list.
It uses a keys
helper function to extract the unique list of keys with the
help of library function List.sort_uniq
. That function sorts an input list and
in the process discards duplicates. It requires a comparison function as input.
Note
A comparison function must return 0 if its arguments compare as equal, a positive integer if the first is greater, and a negative integer if the first is smaller.
Here we use the standard library’s comparison function Stdlib.compare
, which
behaves essentially the same as the built-in comparison operators =
, <
, >
,
etc. Custom comparison functions are useful if you want to have a relaxed notion
of what being a duplicate means. For example, maybe you’d like to ignore the
case of strings, or the sign of a number, etc.
The running time of List.sort_uniq
is linearithmic, and it produces a linear
number of keys as output. For each of those keys, we do a linear-time lookup
operation. So the total running time of bindings
is \(O(n \log n) + O(n) \cdot
O(n)\), which is \(O(n^2)\). We can definitely do better than that with
more advanced data structures.
Actually we can have a constant-time bindings
operation even with association
lists, if we are willing to pay for a linear-time insert
operation:
module UniqAssocListMap : Map = struct
(** The list [(k1, v1); ...; (kn, vn)] binds key [ki] to value [vi].
No duplicate keys may occur. *)
type ('k, 'v) t = ('k * 'v) list
let empty = []
let insert k v m = (k, v) :: List.remove_assoc k m
let lookup k m = List.assoc k m
let bindings m = m
end
module UniqAssocListMap : Map
That implementation removes any duplicate binding of k
before inserting
a new binding.
5.6.6. Sets#
Here is a module type for sets. There are many other operations a set data structure might be expected to support, but these will suffice for now.
module type Set = sig
(** ['a t] is the type of sets whose elements are of type ['a]. *)
type 'a t
(** [empty] is the empty set. *)
val empty : 'a t
(** [mem x s] is whether [x] is an element of [s]. *)
val mem : 'a -> 'a t -> bool
(** [add x s] is the set that contains [x] and all the elements of [s]. *)
val add : 'a -> 'a t -> 'a t
(** [elements s] is a list containing the elements of [s]. No guarantee
is made about the ordering of that list, but each element is guaranteed
to be unique. *)
val elements : 'a t -> 'a list
end
module type Set =
sig
type 'a t
val empty : 'a t
val mem : 'a -> 'a t -> bool
val add : 'a -> 'a t -> 'a t
val elements : 'a t -> 'a list
end
Here’s an implementation of that interface using a list to represent the set. This implementation ensures that the list never contains any duplicate elements, since sets themselves do not:
module UniqListSet : Set = struct
type 'a t = 'a list
let empty = []
let mem = List.mem
let add x s = if mem x s then s else x :: s
let elements = Fun.id
end
module UniqListSet : Set
Note how add
ensures that the representation never contains any duplicates, so
the implementation of elements
is easy. Of course, that comes with the
tradeoff of add
being linear time.
Here’s a second implementation, which permits duplicates in the list:
module ListSet : Set = struct
type 'a t = 'a list
let empty = []
let mem = List.mem
let add = List.cons
let elements s = List.sort_uniq Stdlib.compare s
end
module ListSet : Set
In that implementation, the add
operation is now constant time, and the
elements
operation is linearithmic time.