8.6. Persistent Arrays#
OCaml’s built-in array
type offers constant-time get and set operations. But it is an ephemeral data structure. That leads to the question: could we have persistent arrays, and if so, how good of performance could we get?
Here is the interface we would like to implement:
module type PersistentArray = sig
type 'a t
(** The type of persistent arrays whose elements have type ['a]. The
array indexing is zero based, meaning that the first element is at
index [0], and the last element is at index [n - 1], where [n] is
the length of the array. Any index less than [0] or greater
than [n - 1] is out of bounds. *)
val make : int -> 'a -> 'a t
(** [make n x] is a persistent array of length [n], with each element
initialized to [x]. Raises [Invalid_argument] if [n] is negative
or too large for the system. *)
val length : 'a t -> int
(** [length a] is the number of elements in [a]. *)
val get : 'a t -> int -> 'a
(** [get a i] is the element at index [i] in [a]. Raises [Invalid_argument]
if [i] is out of bounds. *)
val set : 'a t -> int -> 'a -> 'a t
(** [set a i x] is an array that stores [x] at index [i] and otherwise is
the same as [a]. Raises: [Invalid_argument] if [i] is out of bounds. *)
end
module type PersistentArray =
sig
type 'a t
val make : int -> 'a -> 'a t
val length : 'a t -> int
val get : 'a t -> int -> 'a
val set : 'a t -> int -> 'a -> 'a t
end
That interface essentially the same as OCaml’s Array
module, but reduced to just four essential functions.
We’ll next see three implementations of that interface. Each will achieve progressively better performance. Our final implementation is based on Conchon and Filliâtre (2007); see the end of this section for citations.
8.6.1. Copy-On-Set Arrays#
The easiest implementation of persistent arrays is to just make a copy of the entire array on each set
operation. That way, old copies of the array persist—they are not changed by later set
operations. Here is the implementation:
module CopyOnSetArray : PersistentArray = struct
type 'a t = 'a array
(** Efficiency: O(n). *)
let make = Array.make
(** Efficiency: O(1). *)
let length = Array.length
(** Efficiency: O(1). *)
let get = Array.get
(** Efficiency: O(n). *)
let set a i x =
let a' = Array.copy a in (* copy the array *)
a'.(i) <- x; (* mutate one element *)
a'
end
module CopyOnSetArray : PersistentArray
We pay a large price for set
in this implementation: instead of \(O(1)\) like array
, it is \(O(n)\). We’ll improve on that in our next implementation.
Note
Before moving on, let’s pause and consider one aspect of CopyOnSetArray
that is different than any data structure we’ve seen before. The interface it satisfies is written in the functional style, but the implementation uses imperative features. Note that the set
operation does not return unit
; instead, it produces a new array out of an old array. Underneath the hood, it achieves that persistence despite using an ephemeral data structure, the array
. A lesson we can learn is that it is possible to build persistent data structures in imperative languages using the features they provide.
8.6.2. Version-Tree Arrays#
The expensive set
operation in copy-on-set is doing way too much work. Even though only one array element is being changed, all of the array elements are copied. To achieve better performance, we could eliminate the copying and instead keep a little log of all the set
operations that have occurred. Each entry in the log would need to record just one change that has been made. For example, suppose we had the following sequence of operations:
let a0 = make 3 0
let a1 = set a0 1 7
let a2 = set a1 2 8
let a3 = set a1 2 9
Our log would say:
a0
is[|0; 0; 0|]
. That takes linear space to store.a1
is the same asa0
except that index1
is7
. That takes only constant space to store.a2
is the same asa1
except that index2
is8
. Again, constant space.a3
is the same asa1
except that index2
is9
. Again, constant space.
Note how both a2
and a3
share the same base array a1
, but diverge in different ways by setting index 2
to either 8
or 9
. So the structure of this log is best represented not as a list, but as a tree:
(entry for a0)
|
(entry for a1)
/ \
(entry for a2) (entry for a3)
Each of the nodes in that tree tell us what different versions of the array should be. That idea leads us to introduce the following tree type:
type 'a version_tree =
| Base of 'a array
| Diff of int * 'a * 'a version_tree
type 'a version_tree = Base of 'a array | Diff of int * 'a * 'a version_tree
A Base
node contains the base version of the array; in our example above, that would be a0
. A Diff
node records a difference that is created by a set
operation. The version tree for our example above would look like this:
(Base [|0; 0; 0|]) <---- a0
|
(Diff 1 7) <---- a1
/ \
a2 ----> (Diff 2 8) (Diff 2 9) <---- a3
The tree itself is shown in the middle of that diagram. The pointers for a0
through a3
are to individual nodes of that tree and capture each of the persistent versions of the array.
Tip
In understanding this version tree type, there is a possibly helpful analogy to association lists. Each of the Diff
nodes is like a cons cell in an association list: both contain a pair of a key (the array index) and a value (the element at that index). Moreover each cons cell then continues onto another cons cell until eventually reaching the the base case of nil, just like a version tree eventually reaches the base case of a Base
node. The Base
node is a little more interesting than the empty list though, because it contains an entire array.
Set. To implement set a i x
, all we have to do is create a new Diff
node:
let set a i x = Diff (i, x, a)
val set : 'a version_tree -> int -> 'a -> 'a version_tree = <fun>
The efficiency of that is merely \(O(1)\), which is a great improvement over copy-on-set arrays.
Get. To implement get a i
, we just have to walk a path in the tree starting at a
to find the first Diff
that records a change in index i
. If there never is such a change found, we can return the value at that index in the Base
node:
let rec get a i =
match a with
| Base b -> b.(i)
| Diff (j, v, a') ->
if i = j then v else get a' i
val get : 'a version_tree -> int -> 'a = <fun>
The efficiency of that implementation is \(O(k)\), where \(k\) is the number of set
operations that have been performed on the persistent array. In the worst case, all of those set
operations were performed on a different index than we want to get
, and they were all performed without creating any branches in the tree — for example, make 2 0 |> set 0 1 |> set 0 2 |> ... |> set 0 k |> get 1
. Then the tree degenerates to a linked list, and get
is forced to walk down the entire list to the Base
node to find the original value.
Note that \(k\) might be much bigger or much smaller than \(n\), the size of the array. So in improving the performance of set
, we potentially worsened the performance of get
. In our next implementation, we’ll return to this issue and improve the performance of get
.
Version-tree implementation. Pulling it all together, here is an implementation of persistent arrays using version trees:
module VersionTreeArray : PersistentArray = struct
(** AF: A rep such as [Diff (i, x, Diff (j, y, Base a))] represents the
array [a] except element [j] is [y] and element [i] is [x]. If there
are multiple diffs for an index, the outermost wins. E.g.,
[Diff (i, x, Diff (i, y, Base a))] represents an array whose
element [i] is [x], not [y]. *)
type 'a t =
| Base of 'a array
| Diff of int * 'a * 'a t
(** Efficiency: O(n). *)
let make n x = Base (Array.make n x)
(** Efficiency: O(k), where k is the number of [set] operations that have
been performed on the array. *)
let rec length = function
| Base b -> Array.length b
| Diff (_, _, a) -> length a
(** Efficiency: O(k). *)
let rec get a i =
match a with
| Base b -> b.(i)
| Diff (j, x, a') ->
if i = j then x else get a' i
(** Efficiency: O(1). *)
let set a i x = Diff (i, x, a)
end
module VersionTreeArray : PersistentArray
8.6.3. Rebasing Version-Tree Arrays#
With version trees, we achieved constant-time set
operations for a persistent array, but get
operations were no longer constant time — they were \(O(k)\) where \(k\) was the number of set
operations that had been perfromed. Is there a way to make both operations constant time while still being persistent? The answer is: almost!
Right now, the original version of the array as stored in the Base
node is primary and always has constant-time access with get
. We are going to introduce a new rebasing operation that can make any other version of the array become primary. Then that version will have constant-time access with get
, even though other versions will still be \(O(k)\). When a version \(v\) of an array is accessed (either with length
or set
), we will mutate the version tree to make \(v\) primary. That means changing the Base
node to store the contents according to \(v\), and adjusting Diff
nodes as needed to compensate for that change in the base.
Since we will now be mutating trees, the representation type needs to change to add a level of indirection:
type 'a rebasing_tree = 'a node ref
and 'a node =
| Base of 'a array
| Diff of int * 'a * 'a rebasing_tree
type 'a rebasing_tree = 'a node ref
and 'a node = Base of 'a array | Diff of int * 'a * 'a rebasing_tree
Make and set. The make
and set
operations require just the insertion of a call to ref
to adapt to the new representation type:
let make n x = ref (Base (Array.make n x))
let set a i x = ref (Diff (i, x, a))
val make : int -> 'a -> 'a node ref = <fun>
val set : 'a rebasing_tree -> int -> 'a -> 'a node ref = <fun>
Their efficiency is unchanged: \(O(n)\) for make
, and \(O(1)\) for set
.
Rebase. To rebase a tree we introduce a new recursive helper function rebase : 'a rebasing_tree -> 'a rebasing tree
such that rebase t
mutates tree t
to make the version of the array represented by t
be primary. Therefore, when rebase
returns t
is guaranteed to be a ref to a Base
node.
There are two cases to consider. First, if t
is a ref to a Base
node, then our work is already done. That version is already primary.
Second, if t
is a ref to a Diff
node, then we have work to do. That node has the form Diff(i, x, t')
where t'
represents another version of the array. We call rebase
on t'
to make it primary; after that, t'
must be a ref to a Base
node. That means we have the following situation, where a
is an array:
t ---> Diff (i, x, t')
t' ---> Base a
To make t
primary, we just need to swap those nodes around, as well as swap the values found at index i
. Suppose that in the Base
node a.(i)
is y
. We mutate a.(i)
to be x
; let’s call that array a'
. And we create a new Diff
node with y
:
t ---> Base a'
t' ---> Diff (i, y, t)
Now t
is primary and represents the same version of the array as it did before the rebase.
Here is the code that implements rebasing:
let rec rebase a =
match !a with
| Base b -> b
| Diff (i, x, a') ->
let b = rebase a' in
let old_x = b.(i) in
b.(i) <- x;
a := Base b;
a' := Diff (i, old_x, a);
b
val rebase : 'a rebasing_tree -> 'a array = <fun>
The efficiency of rebase
is \(O(k)\), because there could be up to \(k\) Diff
nodes between t
and its original Base
. Immediately after calling rebase t
, any further calls to rebase t
will be only \(O(1)\), because t
is now primary.
Get and length. Now that we have rebase
, there is only a small modification needed to length
and get
to call rebase
:
let length a = Array.length (rebase a)
let get a i = (rebase a).(i)
val length : 'a rebasing_tree -> int = <fun>
val get : 'a rebasing_tree -> int -> 'a = <fun>
The efficiency of these is the same as rebase
: \(O(k)\) on the first call, then \(O(1)\) on future calls as long as no other version of the array has been accessed in the mean time.
Rebasing version-tree implementation: Pulling it all together, here is an implementation of persistent arrays using rebasing version trees:
module RebasingVersionTreeArray : PersistentArray = struct
type 'a t = 'a node ref
(** See [VersionTreeArray]. *)
and 'a node =
| Base of 'a array
| Diff of int * 'a * 'a t
(** Efficiency: O(n). *)
let make n x = ref (Base (Array.make n x))
(** Efficiency: O(k), where k is the number of diffs between this version
of the array and its base. At most, that is the number of [set]
operations performed on all versions of the array. If there aren't
any diffs, O(1). After a [rebase], remains O(1) until a different
version of the array is accessed. Not tail recursive. *)
let rec rebase a =
match !a with
| Base b -> b
| Diff (i, x, a') ->
let b = rebase a' in
let old_x = b.(i) in
b.(i) <- x;
a := Base b;
a' := Diff (i, old_x, a);
b
(** Efficiency: Same as [rebase]. *)
let length a = Array.length (rebase a)
(** Efficiency: Same as [rebase]. *)
let get a i = (rebase a).(i)
(** Efficiency: O(1). *)
let set a i v = ref (Diff (i, v, a))
end
module RebasingVersionTreeArray : PersistentArray
With rebasing we now have \(O(1)\) get
and set
operations on the primary (most recently accessed version) of the array. We pay an \(O(k)\) price to change to a new primary version, but after that, access to it remains \(O(1)\) until another rebase occurs.
8.6.4. Citations#
Our final implementation is most similar to that of Sylvain Conchon and Jean-Christophe Filliâtre (A persistent union-find data structure, ACM Workshop on ML, 2007). They credit Henry Baker for the idea of version trees (Shallow binding in LISP 1.5, CACM 21:7, 1978) and rebasing (Shallow binding makes functional arrays fast, SIGPLAN Not., 26(8):145-147, 1991).