tdbm: a simple, high-performance database with nested atomic transactions

A brief introduction

tdbm is a dbm-type database [0, 1, 2] with nested atomic transactions. These kinds of lightweight databases are sometimes called embedded (or embeddable) databases because they must be used as a component of a bigger program, which can be as simple or complicated as required.

Dbm-type databases have a common core set of advantages:

When used in appropriate applications, they are a wonderful tool, which is why they are widely used behind the scenes in server software, web browsers, and a wide variety of applications. They are particularly well suited to managing indexes, which typically are comprised of relatively short key/value pairs, and caches. Easy-to-use interfaces to various dbm-type databases are provided by Perl, PHP, and many other languages.

Unlike the heavyweights, such as MySQL, PostgreSQL, Microsoft SQL Server, Oracle Database, IBM DB2, and other relational database management systems, dbm-type databases provide neither a sophisticated query language nor any management tools. If these capabilities are required, they are typically implemented for each application that embeds a database. Some dbm-type databases have limitations with respect to maximum key and data length that can make them unsuitable for particular applications, and many offer little in the way of concurrency control and support for multithreaded operation. Some do not handle deleting-while-traversing properly, and duplicate keys are sometimes not handled very well. For these reasons, a full-featured database may use a dbm-type database as a backend storage engine.

There are also non-dbm databases that can be embedded, such as SQLite and c-tree.

The dbm-type database most similar to tdbm is probably BerkeleyDB (bdb), which is perhaps also the best known and most widely used database of this kind. Although bdb is far more feature-rich than tdbm, it is also more difficult to learn and program, and it comes with different licensing terms.

I designed and implemented tdbm at around the same time as bdb was being created. Articles about both databases appeared in the same Usenix proceedings in 1992. tdbm was originally written to provide very fast data storage, access, and indexing for a multithreaded X.500 directory server. BerkeleyDB was acquired by Oracle Corporation from Sleepy Cat Software in 2006. tdbm has been used in several commercial products. The 1.2 version that is available here is from the code base released under a BSD-style license in 1994 by UBC (note: the contact information accompanying that version is obsolete).

Distinguishing features of tdbm

tdbm distinguishes itself from most dbm-type databases in three ways:

  1. If the application using tdbm crashes, the database contents are safe. This is an important characteristic because database files used by other dbm-type databases are typically fragile and can be irreparably corrupted by a crash; if the database contents cannot be regenerated from another data source or restored from a dump, there is a good chance they will be lost.
  2. It provides atomic transactions, meaning (among other things) that two or more pieces of data can be modified in an all-or-nothing manner. As a result, an application always sees consistent data, even if the application fails (intentionally, due to a bug, or while being debugged), or because of some kinds of computer crashes. Partially completed work does not need to be "undone" by an application when it resumes after abnormal termination.
  3. tdbm also supports multithreaded operation with nested atomic transactions - it does all the necessary buffer and lock management automatically. But because it was developed before POSIX threads became popular, locking and such in the 1.2 distribution relies on thread software that was specially developed and never officially distributed. So multiple simulataneous users have never been supported by the public version - but I think a commercial version may have done so.

Also, tdbm can also store very large data items, supports memory-resident (volatile or non-persistent) databases, and offers very competitive performance.

tdbm can provide correct atomic transactions in the face of application crashes, but not in the event of OS or storage crashes; that is, if an application that uses tdbm crashes for any reason but the OS and storage device (e.g., disk drive) continue to operate normally, then it will function as it should, provided its disk writes are eventually executed. But if the OS or storage device fail, tdbm cannot make any promises because it cannot know whether all the data it has written has actually reached the physical media, even if its API says it has - if all of its committed disk writes succeed before the crash, all will be well, otherwise one or more transactions may be lost. More about that in the next section.

Disk Write Caching!?

Does tdbm provide true atomic transaction semantics? It can, but only if the operating system, file system, device driver, and storage device provide tdbm with the necessary semantics.

Any database, not just tdbm, that tries to give correct transactional semantics on top of ordinary file systems with IDE/EIDE or SATA drives (and perhaps SCSI as well) has a major challenge to deal with. The problem is this: at some point in its execution, a database must be able to guarantee that a committed transaction has succeeded and no matter what happens next (short of storage device destruction), the transaction will not be reversed. This is the durability property (the "D" in ACID) of atomic transactions. But when disk write caching (of the write-back variety) is enabled on a drive, an OS (Linux, FreeBSD, Windows, and others) cannot make this guarantee unless it specifically addresses the issue. Disk write caching is important for disk performance, but at the cost of possibly losing data if a crash occurs [1].

The FreeBSD fsync(2) and open(2) system calls and Linux fsync(2) and fdatasync(2) system calls force data from OS buffers to disk drive buffers, but not necessarily to the physical disk media [2]. This despite the IEEE Std 1003.2 ("POSIX.2") specification of fsync(2) which clearly prescribes the behaviour that we selectively require. Depending upon who you ask, the responsibility lies with disk drive manufacturers or disk driver writers. Although this deficiency has been recognized for a very long time, apparently (at this writing, at least) no solution has been implemented on Linux or FreeBSD.

While the OS might carefully manage its disk buffers, a disk drive may effectively undo all those efforts and perform actual disk writes whenever it sees fit and in whatever order it likes. So even after fsync(2) or any other file system related system call succeeds and returns, a crash can still lose or corrupt data [3]. For this reason when disk write caching is enabled, correct atomic transactions cannot be guaranteed.

Disabling disk write caching may be possible on some drives [4], but that will affect the performance of all operations on the disk; if you can dedicate the disk to database use, that might fix the problem. Or, if you are willing to customize the disk driver, you may be able to force the drive to flush its buffers to the physical media. Using a UPS can help handle some failure modes. And of course using a different type of storage for the database is another solution, such as a solid-state drive or a device controller with a battery-backed write cache, as is a much more complicated, and probably slower, distributed approach. Recent versions of Linux apparently include support for write barriers for some file systems and storage devices; this mechanism ensures that all disk writes enqueued before the barrier entry are written to the media before the barrier entry is deemed to have been processed [5, 6]. With the Linux 3.1 kernel, barriers are enabled by default for the Ext3 filesystem. But can database software automatically determine whether a host provides the necessary support?

So the bottom line is that, in general, a database has no efficient and off-the-shelf way of knowing with certainty that the data it has asked to be written to disk is really safely there. If you want to understand this problem better, there are lots of discussions about it; here is a good place to start. Also see [7, 8, 9, 10, 11, 12, 13].

It is not only databases that are affected by this problem. Any program might well be broken if it performs an operation that causes a file system modification that the program expects will survive a crash. For example, a program that simply emits consecutive integers without repetition, across any program failure or OS reboot, could encounter this problem. This is because it must save state in the file system, but as already discussed this cannot be done reliably.

Dxstore

I have put a lot of work into a successor to tdbm called Dxstore. Although I believe it had a lot of good ideas, the project may have been too ambitious and probably suffered from second-system syndrome - its development was suspended a few years ago with several important pieces not yet implemented.

News

[Jul-07] A review of tdbm and dxstore is underway to decide whether development of one of them should be continued, and what ought to be included in the feature set and what should be left out. Any new development will most likely be released under a GPLv2 license with a commercial licensing alternative. Some work has been done on increasing the reliability of tdbm with respect to disk caching. Replication has also been studied. Are these features that people need? If you think a new, alternative dbm-type db would be swell, please drop me a message of encouragement.

Downloads

The release you will find here is now over 10 years old and certainly needs some work. It runs on many flavours of Unix, and probably also IEEE Std 1003.2 ("POSIX.2") compliant systems.

Related links

If you've used tdbm for anything, or if you have any comments, I'd like to hear from you.

Please contact me at: brachmanATdss.ca


divider
Last updated: 29-Jun-2011 © 1998-2011 DSS Distributed Systems Software Inc. dss@dss.ca