tdbm - a simple, high-performance database with nested atomic transactions

tdbm: a simple, high-performance database with nested atomic transactions

A brief introduction

tdbm is a dbm-type database [1, 2] with nested atomic transactions. These kinds of lightweight, key-value databases are sometimes called embedded (or embeddable) databases because they must be used as a component of a bigger program, which can be as simple or complicated as required. Also known as a NoSQL database, because there may be a featureful API, there is a very simple data model and no query language in the traditional sense.

Dbm-type databases have a set of common core advantages:

they are about as easy as possible for programmers to learn and use;
they offer excellent performance and storage efficiency (very nearly optimal) for the low-level operations they provide;
they have very small memory footprints;
databases are stored in ordinary files and automatically grow as necessary:
- no preallocation of disk space is required;
- there is no a priori maximum database size, modulo build-time and run-time limits, some of which may be configurable;
- they can be moved to new storage devices easily;
- ordinary file management commands can be used to secure them, back them up, duplicate them, etc.
their simple APIs, which often consist of only a dozen or so major functions, can readily be made available through virtually any programming language (C/C++, Perl, Tcl, Java, Python, etc.); and
they require minimal human administration to set up and maintain.

When used in appropriate applications, they are a wonderful tool, which is why they are widely used behind the scenes in server software, web browsers, and a wide variety of applications. They are particularly well suited to managing indexes, which typically are comprised of relatively short key/value pairs, and caches. Easy-to-use interfaces to various dbm-type databases are provided by Perl, PHP, and many other languages.

Unlike the heavyweights, such as MySQL, PostgreSQL, Microsoft SQL Server, Oracle Database, IBM DB2, and other relational database management systems, dbm-type databases provide neither a sophisticated query language nor any management tools. If these capabilities are required, they are usually implemented for each application that embeds a database. Some dbm-type databases have limitations with respect to maximum key and data length that can make them unsuitable for particular applications, and many offer little in the way of concurrency control and support for multithreaded operation. Some do not handle deleting-while-traversing properly, and duplicate keys are sometimes not handled very well. For these reasons, a full-featured database may use a dbm-type database as a backend storage engine.

Some non-dbm databases can also be embedded, such as SQLite and c-tree.

The dbm-type database most similar to tdbm is probably BerkeleyDB (bdb), which is perhaps also the best known and most widely used database of this kind. Although bdb is far more feature-rich than tdbm, it is also more difficult to learn and program, and it comes with different licensing terms (as of version 6.x, Oracle switched to the GNU AFFERO GENERAL PUBLIC LICENSE).

I designed and implemented tdbm at around the same time as bdb was being created. Articles about both databases appeared in the same Usenix proceedings in 1992. tdbm was originally written to provide very fast data storage, access, and indexing for a multithreaded X.500 directory server. BerkeleyDB was acquired by Oracle Corporation from Sleepy Cat Software in 2006. tdbm has been used in several commercial products. The 1.2 version that is available here is from the code base released under a BSD-style license in 1994 by UBC (note: the contact information accompanying that version is obsolete).

Distinguishing features of tdbm

tdbm distinguishes itself from most dbm-type databases in three ways:

If the application using tdbm crashes, the database contents are safe. This is an important characteristic because database files used by other dbm-type databases are typically fragile and can be irreparably corrupted by a crash; if the database contents cannot be regenerated from another data source or restored from a dump, there is a good chance they will be lost.
It provides atomic transactions, meaning (among other things) that two or more pieces of data can be modified in an all-or-nothing manner. As a result, an application always sees consistent data, even if the application fails (intentionally, due to an application bug, or while being debugged), or because of some kinds of system crashes. Partially completed work does not need to be "undone" by an application when it resumes after abnormal termination.
tdbm also supports multithreaded operation with nested atomic transactions - it does all the necessary buffer and lock management automatically. But because it was developed before POSIX threads became popular, locking and such in the 1.2 distribution relies on thread software that was specially developed and never officially distributed. So multiple simulataneous users have never been supported by the public version - but I think a commercial version may have done so.

Also, tdbm can store very large data items, supports memory-resident (volatile or non-persistent) databases, and offers very competitive performance.

tdbm can provide correct atomic transactions (for brevity, we'll just call them transactions from now on) in the face of application crashes, but not in the event of OS or storage crashes; that is, if an application that uses tdbm crashes for any reason and the OS and storage device (e.g., disk drive) continue to operate normally, then it will function as it should, provided any pending disk writes are eventually executed. But if the OS or storage device fail, tdbm cannot make any promises because it cannot know whether all the data it has written has actually reached the physical media, even if its API says it has - if all of its committed disk writes succeed before the crash, all will be well, otherwise one or more transactions may be lost. More about that in the next section.

Disk Write Caching!?

Does tdbm provide true atomic transaction semantics? It can, but only if the operating system, file system, device driver, and storage device provide tdbm with the necessary semantics.

Any database, not just tdbm, that tries to give correct transactional semantics on top of ordinary file systems with IDE/EIDE or SATA drives (and perhaps SCSI as well) has a major challenge to deal with. The problem is this: at some point in its execution, a database must be able to guarantee that a committed transaction has succeeded and no matter what happens next, short of storage device destruction, the transaction will not be reversed. This is the durability property (the "D" in ACID) of transactions. But when disk write caching (of the write-back variety) is enabled on a drive, an OS (Linux, FreeBSD, Windows, and others) cannot make this guarantee unless it specifically addresses the issue. Disk write caching is important for disk performance, but comes at the cost of possibly losing data if a crash occurs [HOW TO: Manually Turn Disk Write Caching On or Off].

The FreeBSD fsync(2) and open(2) system calls and Linux fsync(2), fdatasync(2), and open(2) system calls force data from OS buffers to disk drive buffers. Until recently (maybe), this did not necessarily mean that blocks had actually been written to the physical disk media [Re: scsi vs ide performance on fsync's]. This despite the IEEE Std 1003.2 ("POSIX.2") specification of fsync(2) which clearly prescribes the behaviour that we selectively require. Depending upon who you ask, the responsibility lies with disk drive manufacturers or disk driver writers. This deficiency had been recognized for a very long time.

The syncer kernel process (syncer(4)) in recent versions of FreeBSD allow various buffer flushing delays to be tuned through kernel variables, potentially reducing windows of vulnerability. In Linux, the pdflush kernel thread performs a similar function and is similarly tunable (see The Linux Page Cache and pdflush: Theory of Operation and Tuning for Write-Heavy Loads).

While the OS might carefully manage its disk buffers, a disk drive may effectively undo these efforts and perform actual disk writes whenever it sees fit and in whatever order it likes. So even after close(2) or any other file system related system call succeeds and returns, a crash can still lose or corrupt data [scsi vs ide performance on fsync's]. For this reason, when disk write caching is enabled and no special support is available, correct transactions cannot be guaranteed.

Disabling disk write caching may be possible on some drives [SATA Disks that handle write caching properly?], but that will affect the performance of all operations on the disk; if you can dedicate the disk to database use, that might fix the problem. Or, if you are willing to customize the disk driver, you may be able to force the drive to flush its buffers to the physical media. A UPS can help handle some failure modes. Using a different type of storage device for the database, such as a solid-state drive or a device controller with a battery-backed write cache, may be another solution, provided its behaviour is well understood.

Versions of Linux (2.4?) included support for write barriers for some file systems and storage devices (implementation methods are discussed in barrier.txt, July, 2005). This mechanism ensures that all disk writes enqueued before the barrier entry are written to the media before the barrier entry is deemed to have been processed [Does your device honor write barriers?], [That massive filesystem thread]. With the Linux 3.1 kernel, barriers are enabled by default for the Ext3 filesystem and can be explicitly enabled or disabled using a mount option, although issues remain. But can database software reliably and automatically determine whether its platform provides the necessary support (even if it is able to determine at runtime whether barriers are enabled)?

Most recently, the required functionality is purportedly obtained (where supported by hardware) through the NCQ Forced Unit Access (FUA) command, apparently replacing write barriers in Linux [Explicit volatile write back cache control, Remove mentioning of block barriers]. But the current Linux (2.6) manual page for fsync(2) continues to state that if "the underlying hard disk has write caching enabled, then the data may not really be on permanent storage when fsync() / fdatasync() return". The most recent FreeBSD manual page for fsync(2) does not mention caching at all [see the thread How to disable hard disk write cache?].

So maybe the problem has been properly solved, or maybe not. There does not appear to be a mechanism available to a Linux or FreeBSD application to query the system (e.g., via sysctl()) about whether flushing a file descriptor via fsync(2) results in the data being written to physical media on a given file system.

Schemes can be developed that rely on checksumming disk-based transaction information (e.g., UNDO or REDO logs), or that write transaction information to reliable persistent storage, but they must have an infallible method of determining when it is safe to delete the information. And because a failure can also occur during the recovery procedure, recovery actions may not be durable, implying that it may never be 100% safe to delete the transaction information. In practice, isn't a crash or sequence of crashes fairly likely during a system reboot, during which time recovery processing would be in progress for server applications?

A much more complicated, probably slower, and more expensive distributed solution may be built. Though it can improve the probability of durability, since the disk storage of one or more nodes may fail at an inopportune moment, it still cannot offer a guarantee.

Other methods of maintaining file system consistency have been developed (e.g., Featherstitch).

So the bottom line is that, in general, a database has no efficient and off-the-shelf (portable) way of knowing with certainty that the data it has asked to be written to disk is really safely there. If you want to understand this problem better, there are lots of discussions about it; here is a good place to start. For additional information, see: [2, 3, 4, 5, 6, 7, 8].

It is not only databases that are affected by this problem. Any program might be broken if it performs an operation that causes a file system modification that the program expects will survive a crash. For example, a program that simply emits consecutive integers without repetition, across any program failure or OS reboot, could encounter this problem. This is because it must save state in the file system, but as already discussed, this cannot be done reliably. Software that needs to atomically update an encryption or signing key, to guarantee that it will never be reused, must also be aware of these issues.

Perfect is the enemy of good enough.
Designers of portable, hardware-independent database systems would, of course, like their atomic commit algorithm to function correctly and have sufficiently good performance, but in a larger sense is this a realistic goal?

Even if the algorithm were correct, the non-zero probability of hardware errors (RAM, persistent storage device, etc.), software errors (operating system, BIOS, storage device firmware, etc.), or bugs in the database implementation itself mean that no absolute guarantee of correctness can be made. Is it sufficent to offer correct, efficient transactional behaviour with an imperfect though sufficiently high probability? If all assumptions and conditions are understood, perhaps this is a reasonable goal for a great many applications.

It is worthwhile to note, in conclusion, that it is standard practice for many applications to archive the raw data that is processed and added to a database. If transactions are lost as a consequence of a relatively rare error, it is possible to recover by re-applying those transactions to the last correct copy of the database. This suggests that it is good practice to take regular snapshots of database files and plan for this eventuality.

Dxstore

I have put a lot of work into a successor to tdbm called Dxstore. Although I believe it had a lot of good ideas, the project may have been too ambitious and probably suffered from second-system syndrome - its development was suspended a few years ago with several important pieces not yet implemented.

News

From time to time, a review of tdbm and dxstore is undertaken to decide whether development of one of them should be resumed, and what ought to be included and discarded from the feature set. Any new development will most likely be released under a BSD-style or GPLv2 license, possibly with a commercial licensing alternative. Some work has been done on increasing the reliability of tdbm with respect to disk caching. Replication has also been studied. Are these features that people need? If you think a new, alternative dbm-type db would be swell, please drop me a message of encouragement or suggestions.

Downloads

The release you will find here is now over 10 years old and certainly needs some work. It runs on many flavours of Unix, and probably also IEEE Std 1003.2 ("POSIX.2") compliant systems.

README
LICENSE
tdbm-1.2.tbz (122074 bytes)
tdbm-1.2.tgz (156347 bytes)
Manual page (PostScript, PDF, HTML, nroff source)
"TDBM: A DBM Library With Atomic Transactions", Proc. of the USENIX Summer Technical Conference, June 1992, pp. 63-80. (PostScript, PDF)