tdbm is a dbm-type database [0, 1, 2] with nested atomic transactions. These kinds of lightweight databases are sometimes called embedded (or embeddable) databases because they must be used as a component of a bigger program, which can be as simple or complicated as required.
Dbm-type databases have a common core set of advantages:
When used in appropriate applications, they are a wonderful tool, which is why they are widely used behind the scenes in server software, web browsers, and a wide variety of applications. They are particularly well suited to managing indexes, which typically are comprised of relatively short key/value pairs, and caches. Easy-to-use interfaces to various dbm-type databases are provided by Perl, PHP, and many other languages.
Unlike the heavyweights, such as MySQL, PostgreSQL, Microsoft SQL Server, Oracle Database, IBM DB2, and other relational database management systems, dbm-type databases provide neither a sophisticated query language nor any management tools. If these capabilities are required, they are typically implemented for each application that embeds a database. Some dbm-type databases have limitations with respect to maximum key and data length that can make them unsuitable for particular applications, and many offer little in the way of concurrency control and support for multithreaded operation. Some do not handle deleting-while-traversing properly, and duplicate keys are sometimes not handled very well. For these reasons, a full-featured database may use a dbm-type database as a backend storage engine.
There are also non-dbm databases that can be embedded, such as SQLite and c-tree.
The dbm-type database most similar to tdbm is probably BerkeleyDB (bdb), which is perhaps also the best known and most widely used database of this kind. Although bdb is far more feature-rich than tdbm, it is also more difficult to learn and program, and it comes with different licensing terms.
I designed and implemented tdbm at around the same time as bdb was being created. Articles about both databases appeared in the same Usenix proceedings in 1992. tdbm was originally written to provide very fast data storage, access, and indexing for a multithreaded X.500 directory server. BerkeleyDB was acquired by Oracle Corporation from Sleepy Cat Software in 2006. tdbm has been used in several commercial products. The 1.2 version that is available here is from the code base released under a BSD-style license in 1994 by UBC (note: the contact information accompanying that version is obsolete).
tdbm distinguishes itself from most dbm-type databases in three ways:
Also, tdbm can store very large data items, supports memory-resident (volatile or non-persistent) databases, and offers very competitive performance.
tdbm can provide correct atomic transactions (for brevity, we'll just call them transactions from now on) in the face of application crashes, but not in the event of OS or storage crashes; that is, if an application that uses tdbm crashes for any reason and the OS and storage device (e.g., disk drive) continue to operate normally, then it will function as it should, provided any pending disk writes are eventually executed. But if the OS or storage device fail, tdbm cannot make any promises because it cannot know whether all the data it has written has actually reached the physical media, even if its API says it has - if all of its committed disk writes succeed before the crash, all will be well, otherwise one or more transactions may be lost. More about that in the next section.
Does tdbm provide true atomic transaction semantics? It can, but only if the operating system, file system, device driver, and storage device provide tdbm with the necessary semantics.
Any database, not just tdbm, that tries to give correct transactional semantics on top of ordinary file systems with IDE/EIDE or SATA drives (and perhaps SCSI as well) has a major challenge to deal with. The problem is this: at some point in its execution, a database must be able to guarantee that a committed transaction has succeeded and no matter what happens next (short of storage device destruction), the transaction will not be reversed. This is the durability property (the "D" in ACID) of transactions. But when disk write caching (of the write-back variety) is enabled on a drive, an OS (Linux, FreeBSD, Windows, and others) cannot make this guarantee unless it specifically addresses the issue. Disk write caching is important for disk performance, but at the cost of possibly losing data if a crash occurs [HOW TO: Manually Turn Disk Write Caching On or Off].
The FreeBSD fsync(2) and open(2) system calls and Linux fsync(2) and fdatasync(2) system calls force data from OS buffers to disk drive buffers, but not necessarily to the physical disk media [Re: scsi vs ide performance on fsync's]. This despite the IEEE Std 1003.2 ("POSIX.2") specification of fsync(2) which clearly prescribes the behaviour that we selectively require. Depending upon who you ask, the responsibility lies with disk drive manufacturers or disk driver writers. Although this deficiency has been recognized for a very long time, apparently (at this writing, at least) no solution has been implemented on Linux or FreeBSD.
While the OS might carefully manage its disk buffers, a disk drive may effectively undo these efforts and perform actual disk writes whenever it sees fit and in whatever order it likes. So even after fsync(2) or any other file system related system call succeeds and returns, a crash can still lose or corrupt data [scsi vs ide performance on fsync's]. For this reason, when disk write caching is enabled, correct transactions cannot be guaranteed.
Disabling disk write caching may be possible on some drives [SATA Disks that handle write caching properly?], but that will affect the performance of all operations on the disk; if you can dedicate the disk to database use, that might fix the problem. Or, if you are willing to customize the disk driver, you may be able to force the drive to flush its buffers to the physical media. A UPS can help handle some failure modes. Using a different type of storage device for the database, such as a solid-state drive or a device controller with a battery-backed write cache, may be another solution, provided its characteristics are well understood. A much more complicated, but probably much slower, distributed solution may work.
Recent versions of Linux apparently include support for write barriers for some file systems and storage devices; this mechanism ensures that all disk writes enqueued before the barrier entry are written to the media before the barrier entry is deemed to have been processed [Does your device honor write barriers?], [That massive filesystem thread]. With the Linux 3.1 kernel, barriers are enabled by default for the Ext3 filesystem and can be explicitly enabled or disabled using a mount option, although issues remain. But can database software reliably and automatically determine whether its platform provides the necessary support (even if it is able to determine at runtime whether barriers are enabled)?
Schemes can be developed that rely on checksumming disk-based transaction information (e.g., UNDO or REDO logs), or that write transaction information to reliable persistent storage, but they must have an infallible method of determining when it is safe to delete the information. And because a failure can also occur during the recovery procedure, recovery actions may not be durable, implying that it may never be 100% safe to delete the transaction information. In practice, isn't a crash or sequence of crashes fairly likely during a system reboot, during which time recovery processing would be in progress for server applications?
So the bottom line is that, in general, a database has no efficient and off-the-shelf way of knowing with certainty that the data it has asked to be written to disk is really safely there. If you want to understand this problem better, there are lots of discussions about it; here is a good place to start. For additional information, see: [2, 3, 4, 5, 6, 7, 8].
It is not only databases that are affected by this problem. Any program might be broken if it performs an operation that causes a file system modification that the program expects will survive a crash. For example, a program that simply emits consecutive integers without repetition, across any program failure or OS reboot, could encounter this problem. This is because it must save state in the file system, but as already discussed, this cannot be done reliably.
Designers of portable, hardware-independent database systems would, of course, like their atomic commit algorithm to function correctly and have sufficiently good performance, but in a larger sense is this a realistic goal? Even if the algorithm were correct, the non-zero probability of hardware errors (RAM, persistent storage device, etc.), software errors (operating system, BIOS, storage device firmware, etc.), or bugs in the database implementation itself mean that no absolute guarantee of correctness can be made. Perhaps it is sufficent to offer correct, efficient transactional behaviour with an imperfect though sufficiently high probability? If all assumptions and conditions are stated, perhaps this is a reasonable goal.
I have put a lot of work into a successor to tdbm called Dxstore. Although I believe it had a lot of good ideas, the project may have been too ambitious and probably suffered from second-system syndrome - its development was suspended a few years ago with several important pieces not yet implemented.
A review of tdbm and dxstore is underway to decide whether development of one of them should be continued, and what ought to be included in the feature set and what should be left out. Any new development will most likely be released under a GPLv2 license with a commercial licensing alternative. Some work has been done on increasing the reliability of tdbm with respect to disk caching. Replication has also been studied. Are these features that people need? If you think a new, alternative dbm-type db would be swell, please drop me a message of encouragement or suggestions.
The release you will find here is now over 10 years old and certainly needs some work. It runs on many flavours of Unix, and probably also IEEE Std 1003.2 ("POSIX.2") compliant systems.
If you've used tdbm for anything, or if you have any comments, I'd like to hear from you.
Please contact me at: brachmanATdss.ca
|Last updated: 3-Feb-2013||© 1998-2013 DSS Distributed Systems Software Inc.||email@example.com|