Versioned HDF5: A Python Library for Versioning Array Data

For quantitative research involving time series data, it is important to avoid relying on data available only at a later point in time (T2) when generating results for an earlier point in time (T1). The ability to time travel through versioned data sets is critical for accurate and reproducible results. Quansight Labs and the D. E. Shaw group collaborated to develop Versioned HDF5, which allows users to implement versioning within large data sets. In this piece, adapted from a series of Quansight blog posts, Quansight introduces Versioned HDF5 and discusses the design and performance of data libraries using this system.

Introducing Versioned HDF5

The problem of storing and manipulating large amounts of data is a challenge in many scientific computing and industry applications. One of the standard data models for this is HDF5, an open technology that implements a hierarchical structure (similar to a file-system structure) for storing large amounts of possibly heterogeneous data within a single file. Data in an HDF5 file is organized into groups and datasets; you can think about these as the folders and files in your local file system, respectively. You can also optionally store metadata associated with each item in a file, which makes this a self-describing and powerful data storage model.

Hierarchical Data Format (HDF5) Dataset (From https://www.neonscience.org/about-hdf5)Image: Hierarchical Data Format (HDF5) Dataset (From https://www.neonscience.org/about-hdf5)

Since reading and writing operations in these large data files must be fast, the HDF5 model includes data compression and chunking. This technique allows the data to be retrieved in subsets that fit the computer's memory or RAM, which means that it doesn't require the entire file contents to be loaded into memory at once. All this makes HDF5 a popular format in several domains, and with h5py it is possible to use a Pythonic interface to read and write data to a HDF5 file.

Now, let's say you have an HDF5 file with contents that change over time. You may want to add or remove datasets, change the contents of the data or the metadata, and keep a record of which changes occurred when, with a way to recover previous versions of this file. Since HDF5 is a binary file format, using regular version control tools (such as git) may prove difficult.

Introducing the Versioned HDF5 library

The Versioned HDF5 library is a versioned abstraction on top of h5py. Because of the flexibility of the HDF5 data model, all versioning data is stored in the file itself, which means that different versions of the same data (including version metadata) can be stored in a single HDF5 file.

To see how this works in practice, let's say we create a regular HDF5 file with h5py called mydata.h5.

>>> import h5py
>>> fileobject = h5py.File('mydata.h5', 'w')

Now, you can create a VersionedHDF5file object:

>>> from versioned_hdf5 import VersionedHDF5File
>>> versioned_file = VersionedHDF5File(fileobject)

This file still doesn't have any data or versions stored in it. To create a new version, you can use a context manager:

>>> with versioned_file.stage_version('version1') as group:
...     group['mydataset'] = np.ones(10000)

The context manager returns a h5py group object, which should be modified in-place to build the new version. When the context manager exits, the version will be written to the file. From this moment on, any interaction with the versioned groups and datasets should be done via the Versioned HDF5 API, rather than h5py.

Now, the versioned_file object can be used to expose versioned data by version name:

>>> v1 = versioned_file['version1']
>>> v1
<Committed InMemoryGroup "/_version_data/versions/version1">
>>> v1['mydataset']
<InMemoryArrayDataset "mydataset": shape (10000,), type "<f8">

To access the actual data stored in version version1, we use the same syntax as h5py:

>>> dataset = v1['mydataset']
>>> dataset[()]
array([1., 1., 1., ..., 1., 1., 1.])

Suppose now we want to commit a new version of this data, changing just a slice of the data. We can do this as follows:

>>> with versioned_file.stage_version('version2') as group:
...     group['mydataset'][0] = -10

Both versions are now stored in the file, and can be accessed independently.

>>> v2 = versioned_file['version2']
>>> v1['mydataset'][()]
array([1., 1., 1., ...,  1.,  1.,  1.])]
>>> v2['mydataset'][()]
array([-10., 1., 1., ...,  1.,  1.,  1.])]


Design of the Versioned HDF5 Library

In a , we introduced the Versioned HDF5 library and described some of its features. In this post, we'll go into detail on how the underlying design of the library works on a technical level.

Versioned HDF5 is a library that wraps h5py and offers a versioned abstraction for HDF5 groups and datasets. Versioned HDF5 works fundamentally as a copy-on-write system. The basic idea of copy-on-write is that all data is effectively immutable in the backend. Whenever a high-level representation of data is modified, it is copied to a new location in the backend, leaving the original version intact. Any references to the original will continue to point to it.

At a high level, in a versioned system built with copy-on-write, all data in the system is stored in immutable blobs in the backend. The immutability of these blobs is often enforced by making them content-addressable, where the blobs are always referred to in the system by a cryptographic hash of their contents. Cryptographic hashes form a mapping from any blob of data to a fixed set of bytes, which is effectively one-to-one, meaning if two blobs have the same hash, they must be exactly the same data. This means that it is impossible to mutate a blob of data in-place. Doing so would change its hash, which would make it a different blob, since blobs are referenced only by their hash.

Whenever data for a version is committed, its data is stored as blobs in the backend. It may be put into a single blob, or split into multiple blobs. If it is split, a way to reconstruct the data from the blobs is stored. If a later version modifies that data, any blobs that are different are stored as new blobs. If the data is the same, the blobs will also be the same, and hence will not be written to twice, because they will already exist in the backend.

At a high level, this is how the git version control system works, for example (this is a good talk on the internals of git). It is also how versioning constructs in some modern filesystems like APFS and Btrfs.

Versioned HDF5 Implementation

In Versioned HDF5, this idea is implemented using two key HDF5 primitives: chunks and virtual datasets.

In HDF5, datasets are split into multiple chunks. Each chunk is of equal size, which is configurable, although some chunks may not be completely full. A chunk is the smallest part of a dataset that HDF5 operates on. Whenever a subset of a dataset is to be read, the entire chunk containing that dataset is read into memory. Picking an optimal chunk size is a nontrivial task, and depends on things such as the size of your L1 cache and the typical shape of your dataset. Furthermore, in Versioned HDF5 a chunk is the smallest amount of data that is stored only once across versions if it has not changed. If the chunk size is too small, it would affect performance, as operations would require reading and writing more chunks, but if it is too large, it would make the resulting versioned file unnecessarily large, as changing even a single element of a chunk requires rewriting the entire chunk. Versioned HDF5 does not presently contain any logic for automatically picking a chunk size. The pytables documentation has some tips on picking an optimal chunk size.

Because chunks are the most basic HDF5 primitive, Versioned HDF5 uses them as the underlying blobs for storage. This way operations on data can be as performant as possible.

Virtual datasets are a special kind of dataset that reference data from other datasets in a seamless way. The data from each part of a virtual dataset comes from another dataset. HDF5 does this seamlessly, so that a virtual dataset appears to be a normal dataset.

The basic design of Versioned HDF5 is this: whenever a dataset is created for the first time (the first version containing the dataset), it is split into chunks. The data in each chunk is hashed and stored in a hash table. The unique chunks are then appended into a raw_data dataset corresponding to the dataset. Finally, a virtual dataset is made that references the corresponding chunks in the raw dataset to recreate the original dataset. When later versions modify this dataset, each modified chunk is appended to the raw dataset, and a new virtual dataset is created pointing to corresponding chunks.

For example, say we start with the first version, version_1, and create a dataset my_dataset with n chunks. The dataset chunks will be written into the raw dataset, and the final virtual dataset will point to those chunks.

If we then create a version version_2 based off version_1, and modify only data contained in CHUNK 2, that new data will be appended to the raw dataset, and the resulting virtual dataset for version_2 will look like this:

Since both versions 1 and 2 of my_dataset have identical data in chunks other than CHUNK 2, they both point to the exact same data in raw_data. Thus, the underlying HDF5 file only stores the data in version 1 of my_dataset once, and only the modified chunks from version_2's my_dataset are stored on top of that.

All extra metadata, such as attributes, is stored on the virtual dataset. Since virtual datasets act exactly like real datasets and operate at the HDF5 level, each version is a real group in the HDF5 file that is exactly that version. However, these groups should be treated as read-only, and you should never access them outside of the Versioned HDF5 API (see below).

HDF5 File Layout

Inside of the HDF5 file, there is a special _versioned_data group that holds all the internal data for Versioned HDF5. This group contains a versions group, which contains groups for each version that has been created. It also contains a group for each dataset that exists in a version. These groups each contain two datasets, hash_table, and raw_data.

For example, consider a Versioned HDF5 file that contains two versions, version1, and version2, with datasets data1 and data2. Suppose also that data1 exists in both versions and data2 only exists in version2. The HDF5 layout would look like this

/_versioned_data/
├── data1/
│   ├── hash_table
│   └── raw_data
│
├── data2/
│   ├── hash_table
│   └── raw_data
│
└── versions/
├── __first_version__/
│
├── version1/
│   └── data1
│
└── version2/
├── data1
└── data2

__first_version__ is an empty group that exists only for internal bookkeeping purposes.

The hash_table datasets store the hashes for each chunk of data, so that duplicate data will not be written twice, and the raw_data dataset stores the chunks for all versions of a given dataset. It is referenced by the virtual datasets in the corresponding version groups in versions/. For example, the chunks for the data data1 in versions version1 and version2 are stored in _versioned_data/data1/raw_data.

Versioned HDF5 API

The biggest challenge of this design is that the virtual datasets representing the data in each versioned data all point to the same blobs in the backend. However, in HDF5, if a virtual dataset is written to, it will write to the location it points to. This is at odds with the immutable copy-on-write design. As a consequence, Versioned HDF5 needs to wrap all the h5py APIs that write into a dataset to disallow writing for versions that are already committed, and to do the proper copy-on-write semantics for new versions that are being staged. Several classes that wrap the h5py Dataset and Group objects are present in the versioned_hdf5.wrappers submodule. These wrappers act just like their h5py counterparts, but do the right thing on versioned data. The top-level versioned_hdf5.VersionedHDF5File API returns these objects whenever a versioned dataset is accessed. They are designed to work seamlessly like the corresponding h5py objects.


Performance of the Versioned HDF5 Library

In several industry and science applications, a filesystem-like storage model such as HDF5 is the more appropriate solution for manipulating large amounts of data. However, suppose that data changes over time. In that case, it's not obvious how to track those different versions, since HDF5 is a binary format and is not well suited for traditional version control systems and tools.

In a , we introduced the Versioned HDF5 library, which implements a mechanism for storing binary data sets in a versioned way that feels natural to users of other version control systems, and described some of its features. In this post, we'll show some of the performance analysis we did while developing the library, hopefully making the case that reading and writing versioned HDF5 files can be done with a nice, intuitive API while being as efficient as possible. The tests presented here show that using the Versioned HDF5 library results in reduced disk space usage, and further reductions in this area can be achieved with the use of HDF5/h5py-provided compression algorithms. That only comes at a cost of <10x file writing speed.

What are we measuring?

Performance can mean different things for different operations. For the tests presented here, the main goals are:

  • To evaluate the performance (size on disk and I/O speed) of reading/writing versioned HDF5 files and compare it with non-versioned files (that is, files where only the last version of the datasets is stored);
  • To evaluate the performance when reading/writing data to a versioned HDF5 file in a set of different use cases;
  • To evaluate the performance when different parameter options are considered for chunking and compression on versioned HDF5 files.

When different versions of a dataset are stored in a versioned HDF5 file, modified copies of the data are stored as new versions. This means that there may be duplication of data between versions, which might impact the performance of reading, writing or manipulating these files.

In order to analyze this, we will consider test files created with a variable number of versions (or transactions) of a dataset consisting of three ndarrays of variable length. One test includes a two-dimensional ndarray as part of this dataset, but all other test cases consist of three one-dimensional ndarrays per dataset.

With each new version a certain number of rows are added, removed, and modified. For these tests, all changes are made to elements chosen according to a power law which biases the modifications towards the end of the array, simulating a possible use case of modifying more recent results in a given timeseries.

The tests are as follows:

  1. A large fraction of changes is made to the dataset with each new version: The dataset initially has three arrays with 5000 rows, and 1000 positions are chosen at random and changed, and a small number (at most 10) rows are added or deleted with each new version. We will refer to this test as test_large_fraction_changes_sparse.

  2. A small fraction of changes is made to the dataset with each new version: The dataset initially has three arrays with 5000 rows, but only 10 positions are chosen at random and changed, and a small number (at most 10) rows are added or deleted with each new version. We will refer to this test as test_small_fraction_changes_sparse.

  3. A large fraction of changes is made to the dataset with each version, with the same three arrays of 5000 rows defined initially, 1000 positions are chosen at random and changed, but the size of the final array remains constant (no new rows are added and no rows are deleted). We will refer to this test as test_large_fraction_constant_sparse.

In the next tests, the number of modifications is dominated by the number of appended rows. There are two such cases:

  1. In the first case, the dataset contains three one-dimensional arrays with 1000 rows initially, and 1000 rows are added with each new version. A small number (at most 10) values are chosen at random, following the power law described above, and changed or deleted. We call this test test_mostly_appends_sparse.

  2. In the second case, the dataset contains one two-dimensional array with shape (30, 30) and two one-dimensional arrays acting as indices to the 2d array. In this case, rows are only appended in the first axis of the two-dimensional array, and a small number of positions (at most 10) is chosen at random and changed. We call this test test_mostly_appends_dense.

To test the performance of the Versioned HDF5 library, we have chosen to compare a few different chunk sizes and compression algorithms. These values have been chosen arbitrarily, and optimal values depend on different use cases and on the nature of the datasets stored in the file.

File sizes

As the number of versions in a file grows, its size on disk is also expected to grow. However, it is reasonable to expect that the overhead of storing metadata for versioned files doesn't cause the file sizes to explode as the number of versions increases.

We'll start by analyzing how the HDF5 file sizes grow as the number of versions grows. Using a chunk size of 4096, we can see the following results for the 4 one-dimensional test cases:

File sizes for <code>versioned-hdf5</code> files

We can see from the figure that in test_large_fraction_constant_sparse case, writing 5000 versions of a 117KB array, which would take around 572MB in separate files, takes around 252MB - less than half the storage size. Note that the other examples the size of the arrays stored in the file also grow as the number of versions grows, since each new version is changing the original arrays by adding, deleting and changing values in the original arrays. Keep in mind there is redundant data as some of it is not changed during the staging of a new version but it is still being stored. It's also worth noting that for these tests we don't use compression, even though algorithms available in h5py can be used in Versioned HDF5. You can see the Versioned HDF5 documentation for more detailed tests regarding chunking and compression.

Creation times

If we look at the time spent creating the files for each example, comparing chunk sizes but not considering compression, we have something like this:

test1 test2
test3 test4

legend

Now, we can look at the time required to stage a new version in the file, that is, to add a new transaction. The graphs below show, for each fixed number of transactions, the time required to add new versions as the file is created. Note that the scales vary for each test.

test1 test2
test3 test4

It is clear that as the number of versions stored in the file increases, the times required to create versioned HDF5 files increases significantly when compared to regular HDF5 files. However, note that the increase is linear, consistent with what is expected from adding new versions to the file. For example, looking at test_large_fraction_constant_sparse, where the size of the arrays do not increase as new versions are added, choosing (again) a chunk size of 4096 means that writing each new version to file takes about 6-8x as much as in the unversioned case, with more than 50% savings on disk storage. Note that a larger chunk size may mean faster read and write times but can also mean larger file sizes if no compression is used, because of how Versioned HDF5 is designed. This is expected, since all chunks where data has been changed from one version to the next have to be stored. Also, in test_mostly_appends_sparse, where the size of the arrays stored in the file grow significantly with each new version, we can see a marked increase in the times required to stage new versions.

Finally, we'll look at a two-dimensional dataset. In this case, we have chosen different chunk sizes to test, considering that larger chunk sizes increase file sizes considerably.

filesizes_test5

creation_times

We can also look at the times required to create each new version and write it to file in the versioned and unversioned cases. This is shown in the image below (note the different axis scale from previous figures.)

write_versions

This test case is unique for a few reasons. First, having a two-dimensional dataset introduces new considerations, such as the number of rows being added in each axis. For this test case, we have only added (few) new rows to the first axis with each new version, and this might explain why we don't see an increase in the time required to write new versions to file as the number of versions grow. While these are preliminary tests, and multidimensional datasets are still experimental at this point in Versioned HDF5, we can see that there are no unexpected drops in performance and the results can generally be explained by the size of the data stored in each file.

I/O performance for versioned HDF5 files

First, we'll look at the time required to read data from all versions in a file, sequentially. To keep this test short, we'll only analyze the tests using no compression, and chunk size 16384 for the one-dimensional datasets and 256 for the two-dimensional dataset in test_mostly_appends_dense.

Plotting with respect to the number of versions, we get the following:

As expected, read times increase for files with a larger number of versions, but the growth is close to linear in all cases except for test_mostly_appends_sparse, where the large array sizes explain the larger read times.

Next, we'll compare the times necessary to read the latest version on each file. Because of how Versioned HDF5 is designed, this is the same as reading the last array stored in the HDF5 file. For each test, we made 20 different reads of the latest version in order to eliminate fluctuations generated by background processes or the OS (solid lines). We also compare these read times with the time required to read an unversioned file (which only stored the latest version - dashed black line).

In this case, we can see that on average, reading the latest version on a VersionedHDF5File is ~5x-10x slower than reading an unversioned file. Also, the time required to read the latest version from a versioned HDF5 file increases modestly with the number of versions stored in the file, except when the size of the array increases significantly with each new version.

Summary

These results show that the library behaves reasonably well without unexpected overhead. The reduction in file size comes at a moderate cost for file writing speed, and file reading speed for the latest version of the data is unaffected. The largest impact on I/O performance and storage is in fact explained by the size of the datasets stored in the file, and the Versioned HDF5 abstraction does not significantly reduce this performance.

Overall, the worst performance was observed for tests with large array sizes. This seems to show that the file sizes and I/O performance of versioned HDF5 files are significantly affected by the size of the unique datasets stored in each file, which is to be expected. Also, choosing the right chunk size parameter can have an impact on the performance of the library.

Next steps

versioned-hdf5 1.0 has recently been released, and is available on PyPI and conda-forge. You can install it with

conda install -c conda-forge versioned-hdf5

Development for versioned-hdf5 happens on GitHub. Currently, the library supports basic use cases, but there is still a lot to do. We welcome community contributions to the library, including any issues or feature requests.

For now, you can check out the documentation for more details on what is supported and how the library is built.

The Versioned HDF5 library was created by the D. E. Shaw group in conjunction with Quansight.


Disclosures

The contents of this post and any third-party content that is linked to or included in this post (“Linked Content”) are collectively referred to as the “Content.”

The Content is provided for informational purposes only. It does not convey an offer of any type and is not intended to be, and should not be construed as, an offer to sell, or the solicitation of an offer to buy, any interest in any entity, investment, or investment vehicle. The Content does not constitute investment advice. No assurances can be given that any aims, assumptions, expectations, and/or goals expressed or implied in the Content were or will be realized, or that the activities described in the Content have continued or will continue at all or in the same manner as is described.

The D. E. Shaw group, its affiliates and members, and any party or person associated with or acting on behalf of the foregoing shall have no liability for any errors (as a result of negligence or otherwise, to the fullest extent permitted by law in the absence of fraud) in the Content or its production or for the consequences of any party’s reliance on the Content. The D. E. Shaw group does not endorse any information or beliefs discussed in any Linked Content and makes no representation as to the accuracy or adequacy of any Linked Content.

All trademarks, logos, information, and photos are owned by D. E. Shaw & Co., L.P. or its affiliates or used with permission of the applicable third party. The reproduction or retransmission of the Content (other than Linked Content) is prohibited without the prior written consent of D. E. Shaw & Co., L.P.