We use this information to create a better experience for all users. Please review the types of cookies we use below. These cookies allow you to explore OverDrive services and use our core features. The key and value can be almost anything you like—for example, the value could be a JSON document. Logs are incredibly useful, and we will come back to them later.
In algorithmic terms, the cost of a lookup is O n : if you double the number of records n in your database, a lookup takes twice as long. In order to efficiently find the value for a particular key in the database, we need a different data structure: an index. In this chapter we will look at a range of indexing structures and see how they compare. But the general idea behind them is: keep some additional metadata on the side, which acts as a signpost and helps you to locate the data you want.
If you want to search the same data in several different ways, you may need several different indexes on different parts of the data. Maintaining additional structures is overhead, especially on writes. This is an important trade-off in storage systems: well-chosen indexes can speed up read queries, but every index slows down writes.
You can then choose the indexes that give your application the greatest benefit, without introducing more overhead than necessary. Key-value stores are quite similar to the dictionary type that you can find in most programming languages, and which is usually implemented as a hash map hash table. Since we already have hash maps for our in-memory data structures, why not use them to index our data on disk? Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found.
Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote this works both for inserting new keys and for updating existing keys. When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.
Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map. In fact, this is essentially what Bitcask the default storage engine in Riak does. The values can use more space than available memory, since they can be loaded from disk with just one disk seek. A storage engine like Bitcask is well suited in situations where the value for each key is updated frequently. For example, the key might be the URL of a cat video, and the value might be the number of times it has been played incremented every time someone hits the play button.
As described so far, we only ever append to a file—so how do we avoid eventually running out of disk space? A good solution is to break the log into segments of a certain size, and to periodically run a background process for compaction and merging of segments, illustrated in Figure Compaction means throwing away all key-value pairs in the log except for the most recent update for each key.
This makes the segments much smaller assuming that every key is updated multiple times on average , so we can also merge several segments into one. Segments are never modified after they have been written, so the merged segment is written to a new file. While the merge is going on, we can still continue to serve read and write requests as normal, using the old segment files.
After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments—and then the old segment files can simply be deleted. Each segment now has its own in-memory hash table, mapping keys to file offsets. Lots of detail goes into making this simple idea work in practice. To briefly mention some of the issues that are important in a real implementation: File format CSV is not the best format for a log.
Deleting records If you want to delete a key and its associated value, you have to append a special deletion record to the data file sometimes called a tombstone. When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key. Crash recovery If the database is restarted, the in-memory hash maps are lost. However, that might take a long time if the segment files are large, which would make server restarts painful.
Partially written records The database may crash at any time, including halfway through appending a record to the log. Bitcask files include checksums which allow such corrupted parts of the log to be detected and ignored. Concurrency control Writes are appended to the log in a strictly sequential order there is only one writer thread , there are no transactions, and data file segments are append-only and otherwise immutable.
These facts make concurrency fairly simple. This performance difference applies both to traditional spinning-disk hard drives and to flash-based solid state drives SSDs.
In principle, you could maintain a hash map on disk, but unfortunately it is difficult to make an on-disk hash map perform well. These pairs appear in the order that they were written, and values later in the log take precedence over values for the same key earlier in the log. Apart from that, the order of key-value pairs in the file does not matter. Now we can make a simple change to the format of our segment files: we require that the sequence of key-value pairs is sorted by key.
We also require that each key only appears once within each merged segment file the merging process already ensures that. SSTables have several big advantages: 1. Merging segments is simple and efficient, even if the files are bigger than the available memory. You start reading the input files side-by-side, look at the first key in each file, copy the lowest-numbered key to the output file, and repeat.
If the same key appears in several input files, keep the one from the most recent input file, and discard the values in older segments. This produces a new merged segment which is also sorted by key, and which also has exactly one value per key. In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory. So you can jump to the offset for handbag and scan from there until you find handiwork or not, if the key is not present in the file.
You still need an in-memory index to tell you the offsets for some of the keys, but it can be sparse: one key for every few kilobytes of segment file is sufficient, because a few kilobytes can be scanned very quickly. Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk indicated by the shaded area in Figure Each entry of the sparse in-memory index then points at the start of a compressed block.
Nowadays, disk bandwidth is usually a worse bottleneck than CPU, so it is worth spending a few additional CPU cycles to reduce the amount of data you need to write to and read from disk. Fine so far—but how do you get your data to be sorted by key in the first place? Our incoming writes can occur in any order. Maintaining a sorted structure on disk is possible see next section , but maintaining it in memory is much easier.
There are plenty of well-known tree data structures that you can use, such as Red-Black trees or AVL trees. This in-memory tree is sometimes called a memtable. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. When the new SSTable is ready, the memtable can be emptied. This scheme works very well. It only suffers from one problem: if the database crashes, the most recent writes which are in the memtable but not yet written out to disk are lost.
Every time the memtable is written out to an SSTable, the corresponding log can be discarded. The algorithm described here is essentially what is used in LevelDB [] and RocksDB [], key-value storage engine libraries that are designed to be embedded into other applications.
Lucene, an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary. This is implemented with a key-value structure where the key is a word a term , and the value is the IDs of all the documents that contain the word the postings list. In Lucene, this mapping from term to postings list is kept in SSTable-like sorted files, which are merged in the background as needed. For example, the LSM-tree algorithm can be slow when looking up keys that do not exist in the database: you have to check the memtable, then the segments all the way back to the oldest possibly having to read from disk for each one before you can be sure that the key does not exist.
In order to optimize this, LevelDB maintains additional Bloom filters, which allows it to avoid many unnecessary disk reads for non-existent keys. However, the basic idea—keeping a cascade of SSTables that are merged in the background—is simple and effective.
Even when the dataset is much bigger than memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries scanning all keys above some minimum and up to some maximum. And because the disk writes are sequential, the LSM-tree can support remarkably high write throughput.
B-trees The log-structured indexes we have discussed so far are gaining acceptance, but they are not the most common type of index. The most widely-used indexing structure is quite different: the B-tree. They remain the standard index implementation in almost all relational databases, and many non-relational databases use them too.
Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries.
The log-structured indexes we saw earlier break the database down into variable-size segments, typically several megabytes or more in size, and always write a segment sequentially.
By contrast, B-trees break the database down into fixed-size blocks or pages, traditionally 4 kB in size, and read or write one page at a time.
This corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks. Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory. We can use this to construct a tree of pages, similar to a Red-Black tree or a tree, but with a larger branching factor: each page may have hundreds of children, not just two.
Looking up a key using a B-tree index. One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here. Each child is responsible for a continuous range of keys, and the keys in the root page indicate where the boundaries between those ranges lie.
That takes us to a similar-looking page which further breaks down the — range into sub-ranges. Eventually we are down to a page containing individual keys a leaf page , which either contains the value for each key inline, or contains references to the pages where each value can be found. If you want to update the value for an existing key in a B-tree, you search for the leaf page containing that key, and update the value in place.
If you want to add a new key, you need to find the page whose range encompasses the new key, and add it to that page. This algorithm ensures that a B-tree with n keys always has a height of O log n —i. Growing a B-tree by splitting a page. Update-in-place vs. It is assumed that the overwrite does not change the location of the page, i. You can think of overwriting a page on disk as an actual hardware operation. On a magnetic hard drive, this means moving the disk head to the right place, waiting for the right position on the spinning platter to come around, and then overwriting the appropriate sector with new data.
On SSDs, what happens is somewhat more complicated, but it is similarly slow. For example, if you split a page because an insertion caused it to be over-full, you need to write the two pages that were split, and also overwrite their parent page to update the references to the two child pages. This is a dangerous operation, because if the database crashes after writing only some of the pages, you end up with a corrupted index e.
In order to make the database resilient to crashes, it is normal for B-tree implementations to include an additional data structure on disk: a write-ahead log WAL, also known as redo log. This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself.
When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state. On the other hand, log-structured indexes also re-write data multiple times due to repeated background merging. An additional complication of updating pages in-place is that careful concurrency control is required if multiple threads are going to access the B-tree at the same time, otherwise a thread may see the tree in an inconsistent state. Log-structured approaches are simpler in this regard, because they do all the merging in the background without interfering with incoming queries, and atomically swap old segments for new segments from time to time.
Especially in pages on the interior of the tree, keys only need to provide enough information to act as boundaries between key ranges. Packing more keys into a page allows the tree to have a higher branching factor, and thus fewer levels. If a query needs to scan over a large part of the key range in sorted order, that page-by-page layout can be inefficient, because a disk seek may be required for every page that is read. Many B-tree implementations therefore try to lay out the tree so that leaf pages appear in sequential order on disk.
For random reads the most common type of request in many applications there is not a big performance difference between LSM-trees and B-trees—benchmarks are inconclusive. This makes them appealing for write-intensive applications.
An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree.
In new datastores, log-structured indexes are becoming increasingly popular. A primary key uniquely identifies one row in a relational table, or one document in a document database, or one vertex in a graph database. It is also very common to have secondary indexes. Don't wait! Try Yumpu. Start using Yumpu now! What's the problem with this file? Promotional spam Copyrighted material Offensive language or threatening Something else.
Go back. Launching Xcode If nothing happens, download Xcode and try again. Latest commit. Search this site.