tdbm: a simple, high-performance database with nested atomic transactions

A brief introduction

tdbm is a dbm-type database [0, 1, 2] with nested atomic transactions. These kinds of lightweight databases are sometimes called embedded or embeddable databases because they must be used as a component of a bigger program, which can be as simple or complicated as required.

Dbm-type databases have a common core set of advantages:

When used in appropriate applications, they are a wonderful tool, which is why they are widely used behind the scenes by server software, web browsers, and a wide variety of applications. They are particularly well suited to managing indexes, which typically are comprised of relatively short key/value pairs, or being used as a cache. Easy-to-use interfaces to various dbm-type databases are provided by Perl, PHP, and many other languages.

Unlike the heavyweights, such as MySQL, PostgreSQL, Microsoft SQL Server, Oracle Database, IBM DB2, and other relational database management systems, dbm-type databases provide neither a sophisticated query language nor any management tools. If these capabilities are required, they are typically implemented for each application that embeds a database. Some dbm-type databases have limitations with respect to maximum key and data length that can make them unsuitable for particular applications, and many offer little in the way of concurrency control and support for multithreaded operation. Duplicate keys are sometimes not handled very well. For these reasons, sometimes a "big guy" may use a dbm-type database as a backend.

There are also non-dbm databases that can be embedded, such as SQLite and c-tree.

The dbm-type database most similar to tdbm is probably BerkeleyDB (bdb), which is perhaps also the best known and most widely used database of this kind. Although bdb is far more feature-rich than tdbm it is also more difficult to learn and program, and it comes with different licensing terms.

I designed and implemented tdbm at around the same time as bdb was being created. It was originally written to provide very fast data storage, access, and indexing for a multithreaded X.500 directory server. BerkeleyDB was acquired by Oracle Corporation in 2006. tdbm has been used in several commercial products. The 1.2 version that is here is from the code base released under a BSD-style license in 1994 by UBC (note: the contact information there is incorrect).

tdbm

tdbm distinguishes itself from most dbm-type databases in three ways:

  1. If the application software crashes, data will not be scrambled. This is an important characteristic because dbm-type files are typically fragile and can be irreparably corrupted by a crash; if the database contents cannot be regenerated from another data source or restored from a dump, there is a good chance they will be lost.
  2. It provides atomic transactions, meaning (among other things) that two or more pieces of data can be modified in an all-or-nothing manner. As a result, an application always sees consistent data, even if the application fails (intentionally, due to a bug, or while being debugged), or because of some kinds of computer crashes.
  3. tdbm also supports multithreaded operation with nested atomic transactions - it does all the necessary buffer and lock management automatically. But because it was developed before POSIX threads became popular, locking and such in the 1.2 distribution relies on thread software that was specially developed and never officially distributed. So multiple simulataneous users have never been supported by the public version - but I think a commercial version may have done so.
Also, tdbm can also store very large data items, supports memory-resident databases, and offers very competitive performance.

tdbm can provide correct atomic transactions in the face of application crashes, but not OS or disk crashes; that is, if an application that uses tdbm crashes for any reason but the OS and disk drive continue to operate normally, then it will function as it should, provided its disk writes are eventually executed. But if the OS or disk drive fail, tdbm cannot make any promises because it cannot know whether all the data it has written has actually reached the physical media, even if its API says it has - if all of its committed disk writes succeed before the crash all will be well, otherwise one or more transactions may be lost.

Disk Write Caching!?

Does tdbm provide true atomic transaction semantics? It can, but only if the operating system, file system, device driver, and storage device provide tdbm with the required API semantics.

Any database, not just tdbm, that tries to give correct transactional semantics on top of ordinary file systems with IDE drives (and perhaps SCSI as well) has a major challenge to deal with. The problem is this: at some point in its execution, a database must be able to guarantee that a committed transaction has succeeded and no matter what happens next (short of disk drive destruction), the transaction will not be reversed. But when disk write caching is enabled on a drive, an OS (Linux, FreeBSD, Windows, and others) cannot make this guarantee unless it specifically addresses the issue. Disk write caching helps to improve disk performance, but at the cost of possibly losing data if a crash occurs [1].

The FreeBSD fsync(2) and open(2) system calls and Linux fsync(2) and fdatasync(2) system calls force data from OS buffers to disk drive buffers, but not necessarily to the physical disk media [2]. This despite the IEEE Std 1003.2 ("POSIX.2") specification of fsync(2) which clearly prescribes the behaviour that we selectively require. Depending upon who you ask, the responsibility lies with disk drive manufacturers or disk driver writers. Although this deficiency has been recognized for a very long time, apparently no solution has been implemented on Linux or FreeBSD.

Although the OS might carefully manage its disk buffers, the disk drive may effectively undo all those efforts and perform actual disk writes whenever it sees fit. So even after fsync(2) or any other file system related system call succeeds and returns, a crash can still lose or corrupt data [3]. For this reason when disk write caching is enabled, correct atomic transactions are just not possible.

Disabling disk write caching may be possible on some drives, but that will affect the performance of all operations on the disk; if you can dedicate the disk to database use, that might fix the problem. Or, if you are willing to customize the disk driver, you may be able to force the drive to flush its buffers to the physical media. Using a UPS can help handle some failure modes. And of course using a different type of memory device for storing a database is another solution (such as Flash memory), as is a much more complicated distributed approach. Recent versions of Linux apparently include support for "write barriers" for some file systems and store devices; this mechanism ensures that all disk writes enqueued before the barrier entry are written to the media before the barrier entry is deemed to have been processed [4]. Can database software, for example, automatically determine whether a host provides the necessary support?

So the bottom line is that, in general, a database has no efficient and off-the-shelf way of knowing with certainty that the data it has asked to be written to disk is really safely there. If you want to understand this problem better, there are lots of discussions about it; here is a good place to start. Also see [5, 6, 7].

It is not only databases that are affected by this problem. Any program might well be broken if it performs an operation that causes a file system modification that the program expects will survive a crash. For example, a program that simply emits consecutive integers without repetition, across any program failure or OS reboot, could encounter this problem. This is because it must save state in the file system, but as already discussed this cannot be done reliably.

Dxstore

I have put a lot of work into a successor to tdbm called Dxstore. Although I believe it had a lot of good ideas, the project may have been too ambitious and probably suffered from second-system syndrome - its development was suspended a few years ago with several important pieces not yet implemented.

News

[Jul-07] A review of tdbm and dxstore is underway to decide whether development of one of them should be continued, and what ought to be included in the feature set and what should be left out. Any new development will most likely be released under a GPLv2 license with a commercial licensing alternative. Some work has been done on increasing the reliability of tdbm with respect to disk caching. Replication has also been studied. Are these features that people need? If you think a new, alternative dbm-type db would be swell, please drop me a message of encouragement.

Downloads

The release you will find here is now over 10 years old and certainly needs some work. It runs on many flavours of Unix, and probably also IEEE Std 1003.2 ("POSIX.2") compliant systems.

Related links

If you've used tdbm for anything, or if you have any comments, I'd like to hear from you.

Please contact me at: brachmanATdss.ca


© 1998-2008 DSS Distributed Systems Software Inc.
dss@dss.ca