Bw-tree is a “new” form of B-tree for modern hardware. The architecture of Bw-tree consists of two parts: an in-memory, latch-free data structure for modern CPU and a persistent, log-structured store for modern SSD.
The first Bw-tree paper was published in 2013 by Microsoft Research. While the performance looks promising, there are few discussions about it these years. We think the primary reason is the lack of a state-of-the-art implementation, especially in the open-source world. Think about what LevelDB and RocksDB mean to the LSM-tree. If Google and Facebook didn’t open source their works, there would be much less study about the LSM-tree in the past decade.
Moreover, implementing a decent Bw-tree requires much more detailed designs and optimizations than the original paper provides. We will not get excellent performance by simply following the paper. This is not a new story, as CMU researchers have already revealed it in the Open Bw-tree paper in 2018. That paper closes the gap by supplying more details and improvements over the original paper. However, while the Open Bw-tree shines some light on the in-memory part, the persistent part remains unexplored, which is more valuable for any system that wants to handle larger than memory data sets.
PhotonDB is an experimental project to build a high-performance data store in Rust. The first thing we need is a powerful storage engine. Being an experimental project, we want to take a chance to advance the state-of-the-art instead of reinventing the wheel. Based on past research and experience with storage engines, we decided to start with the Bw-tree.
In this post, we first review the B-tree and LSM-tree based on HDD and SSD, respectively. Then we introduce the Bw-tree and its comparisons with the B-tree and LSM-tree. It is important to note that we are not trying to dive into the details of different data structures here. Our main goal is to explain the rationales of our choice by comparing the B-tree, LSM-tree, and Bw-tree at a high level. It is also worth noting that our discussions mainly focus on OLTP applications.
(For simplicity, we use “B-tree” as a general term for the B+ tree and similar variants in this post.)
The B-tree and LSM-tree for HDD
First, let’s date back to the hard disk drive (HDD) era for a while. Two factors determine HDD performance: the seek time and the data transfer time. For OLTP applications, the data transfer size per operation tends to be small, so we don’t have much to do with the data transfer time here. On the other hand, the seek time is usually the critical factor for OLTP performance.
The B-tree uses a large branching factor on internal nodes to reduce the tree’s height, minimizing the required number of seeks to reach the desired data. However, while the B-tree provides optimal read performance, its write performance is not so good. Every write must locate the corresponding leaf node first and then perform a read-modify-write to update the node. This in-place update is expensive because of the initial cost of seeks and the high write amplification. For example, if we want to update a 128B record in an 8KB node, we have to read and write the entire node, which results in a 64 amplification. In this case, we can only do 10MB/s substantial writes on a disk with 640MB/s bandwidth. This makes the B-tree a bad choice for write-intensive applications, especially when the record size is small.
On the contrary, the LSM-tree solves the exact problems B-tree has for write-intensive applications. The essential idea of the LSM-tree can be summarized as batched writes and multi-component merges. Batched writes amortize the cost of individual operations to one seek and one sequential write per batch. The seek time becomes negligible when batches are large enough, and sequential writes are very efficient for HDD. On the other hand, the multi-component merges reduce the write amplification. Suppose we have a 64GB data set, and we batch 1GB writes before merging them into the data set. We still get a 64 write amplification. However, if we introduce another component to buffer 8GB delta data in between, the amplification between every two components becomes 8, which reduces the total amplification of this three-component LSM-tree to 16. We can further reduce the amplification by adding more components when necessary.
But the multi-component structure of the LSM-tree also comes with a cost. That is, the more components we have, the more seeks we need for reads because we have to search all components one by one to find the desired data. Nevertheless, the LSM-tree allows us to tune read and write performance by adjusting the number of components for different applications.
In fact, we think the LSM-tree is more a framework to trade off read and write performance than a specific data structure. We can apply the same framework in different ways to serve our purposes. The original LSM-tree paper applies this framework to B-trees. It employs a B-tree-like structure in each component. So, in a sense, we can regard B-tree as a one-component LSM-tree. We can also do it another way by applying the LSM-tree framework to a set of sorted files instead of B-trees. In this way, we avoid the complexity of maintaining the B-tree structure, which is precisely the fundamental idea of LevelDB. We will talk more about this framework later.
In summary, for HDD, the B-tree provides optimal read performance, while the LSM-tree gives us a framework to trade off between optimal read and write performance. So we can say that both B-tree and LSM-tree are in the optimal design space for HDD.
The B-tree and LSM-tree for SSD
Now we come to the solid-state drive (SSD) era. SSD changes some fundamental assumptions the B-tree and LSM-tree make about storage devices. SSD eliminates the seek time and provides very fast random read performance. High-end SSD today offers over a million IOPS, so we can fully utilize its bandwidth even with small random reads. However, small random writes are still expensive for SSD due to the hardware characteristics. When we overwrite a small page in place, SSD may need to erase and rewrite a large block internally, which results in extra write amplification. Moreover, internal blocks can only be erased a limited number of times before they fail. Hence, small random writes are subject to high write amplification and vulnerable hardware lifespan. In short, for SSD, while both random reads and sequential reads are very fast, sequential writes are still significantly cheaper than random writes.
With these new assumptions about SSD, we should review the design of the B-tree and LSM-tree.
Let’s talk about the B-tree first. Even though the seek time is gone for SSD, reducing the tree’s height with a large branching factor is still a good idea because it minimizes the number of random reads we need to reach the leaf node. In addition, SSD solves another problem we haven’t mentioned before, which is the performance of range scans. B-tree usually doesn’t store leaf nodes in sorted order on disk. So, for range scans, we must locate the corresponding leaf nodes one by one to read the desired data. This is not good for HDD, but it is OK for SSD, thanks to its excellent random read performance. Therefore, we can say that the B-tree still provides optimal read performance for SSD.
However, in-place updates in the B-tree become a bigger problem than before due to the extra write amplification introduced by SSD. Depending on the size of an erasable block in SSD, overwriting an 8KB node may lead to more than tenfold internal write amplification. In this case, for the example we gave above, the actual amplification of updating a 128B record in an 8KB node can be over 1000. Of course, the number will not be so scary in practice since both the SSD controller and the file system try very hard to reduce the amplification with log-structured techniques. But still, we think SSD makes B-tree a worse choice for write-intensive applications than HDD.
Now we turn to the LSM-tree to see if it works well with SSD. Actually, the batched writes and multi-component merges make the LSM-tree an excellent choice for write-intensive applications on SSD. The LSM-tree does no random writes at all, which makes it very write-friendly to SSD. As for reads, the read amplification that comes with the multi-component structure still exists, but the cost is much cheaper now since random reads on SSD are much faster than on HDD. And for range scans, the LSM-tree usually stores data in sorted order within each component. So range scans only require a few random seeks and large sequential reads, which is very efficient for both HDD and SSD. Therefore, we think SSD makes the LSM-tree a good choice for more applications than it does on HDD.
But this is not the end of the story. While the LSM-tree works well with SSD, it hasn’t fully revealed the potential of SSD yet. The secret lies in that the LSM-tree always stores data in sorted order, which is necessary for HDD but not SSD. Modern SSD performs comparably well on random and sequential reads. This opens an opportunity for the LSM-tree to go further. As a result, some LSM-tree variants have been proposed.
One interesting direction is key-value separation proposed by WiscKey. Key-value separation trades more random reads for lower write amplification. The essential idea is to separate keys and values to avoid the unnecessary movement of values along with the process of multi-component merges. The cost is that it requires an extra random read to retrieve the corresponding value. But this is a reasonable trade-off on SSD when values are relatively large. However, one concern on key-value separation is the complexity it introduces. It has to maintain two separated stores, the primary LSM-tree store for keys and the secondary log-structured store for values. The key store needs to merge now and then to maintain a reasonable read and space amplification. The value store needs to garbage collect obsoleted values regularly to reduce space amplification, which in turn causes extra writes to the primary store. So, to balance the read, write, and space amplification, we must co-tune these two stores very carefully for different applications.
On the other hand, the B-tree naturally leverages fast random reads to improve its point lookups and range scan performance on SSD. And it doesn’t need special attention to large values since values are updated in place without unnecessary movement. In fact, larger values have lower write amplification in the B-tree. However, the internal write amplification caused by random writes remains a problem on SSD.
So, at this point, we can see that neither the B-tree nor the LSM-tree really takes full advantage of modern SSD. This makes us rethink the existing data structures and explore new ones.
Let’s talk about the Bw-tree
Our review above shows that the B-tree is a more natural structure to leverage random reads, while the LSM-tree is clearly the winner for writes. So, we can’t help to wonder, is it possible to combine the best of both worlds?
The original LSM-tree paper already employs the B-tree in a way, as we introduced above. But what if we do the opposite to put the LSM-tree inside the B-tree? Specifically, what if we apply the LSM-tree framework on a per node basis in the B-tree? Well, the answer is the Bw-tree.
Let’s put it straight here. We think the Bw-tree is the result of applying the LSM-tree framework on the B-tree at the node level. Sounds fantastic, isn’t it? We will give more details about it in the rest of this post. While the Bw-tree also proposes an interesting latch-free in-memory data structure, other alternatives that outperform the Bw-tree for in-memory workload exist. So the in-memory part is not the primary factor that makes the Bw-tree so appealing to us. In fact, the magic of the Bw-tree lies in how the in-memory part works with the underlying log-structured store to fully utilize the power of modern SSD.
In the Bw-tree, each node is a chain of immutable delta pages instead of a single flattened page. When the length of the delta chain reaches a threshold, delta pages are consolidated into a base page. We can further improve this design by applying the idea of multi-component merges from the LSM-tree. That is, we don’t have to consolidate all delta pages into the base page every time. We can introduce some “components” in between and consolidate small delta pages into larger ones before merging them into the base page. And when we need to persist a node, we only need to flush new delta pages instead of rewriting the entire node every time. This reduces the write amplification (and increases the read amplification) in the same way as the multi-component merges do to the LSM-tree, but on a per node basis. In addition, the Bw-tree writes pages to a log-structured store instead of overwriting them in place. This avoids the internal write amplification caused by random writes in the B-tree. Meanwhile, the Bw-tree still maintains a B-tree structure at the top level, which brings some interesting advantages over the LSM-tree.
We can first compare the cost of the LSM-tree and Bw-tree at an abstract level. The cost of the LSM-tree consists of the cost of the multi-component structure on the top and the cost of the B-tree within each component. On the contrary, the cost of the Bw-tree consists of the cost of the B-tree on the top and then the cost of the “multi-component” delta chain within each node. So, in theory, the LSM-tree and Bw-tree should have the same read, write, and space amplification, especially for uniformly random workloads.
However, in practice, many factors can affect performance. First of all, real-world applications usually access data with some skewness instead of in a uniform distribution. The direct impact of skewness is that some data is updated more frequently than others. For data that is rarely updated, we don’t want to merge them again and again. This implies that we should use a differential algorithm to merge data of different hotness, which is not a novel idea. The LFS paper already studied a similar problem 30 years ago and gave us an algorithm to treat hot and cold data differently to reduce write and space amplification. One prerequisite of the said algorithm is the ability to classify and separate data of different hotness. While extending the LSM-tree for hot-cold separation is also possible, the Bw-tree naturally divides data with different access patterns at the node level. This kind of data localization gives the Bw-tree a fine-grained control of the merge process on a per node basis, which opens a door for other valuable optimizations.
The beauty of data localization on a per node basis is that merges become lightweight enough to be done online. As a result, any thread can consolidate the whole delta chain of a node in a few microseconds whenever necessary. This eliminates the necessity of switching to background threads to merge a big chunk of data from time to time, which is usually the root cause of latency spikes in the LSM-tree. Of course, the Bw-tree still needs background garbage collections (GC) for the log-structured store. But garbage collections are much cheaper and less urgent than merges in the LSM-tree. The performance of the Bw-tree will not deteriorate even if it doesn’t GC at all. This is because the Bw-tree never touches obsoleted data on reads and writes. So, as long as we have enough space, we don’t need to hurry on GC. For example, under severe IO pressure, we can completely stop GC without hurting performance. And since storage capacity is usually the cheapest resource for storage engines, it is reasonable to trade off space amplification for performance as needed.
Another benefit of online consolidation is that we can fix performance outliers in time instead of waiting for background threads to merge data down to the bottommost component. For example, LSM-tree implementations usually support snapshot reads through multi-version concurrency control (MVCC). In this case, excessive deletions within a small range of data will generate a lot of garbage around the data, heavily affecting the read performance. In these LSM-tree implementations, we have to wait for background merges to clean up the garbage at an uncertain time, which makes the situation uncontrollable. Conversely, a Bw-tree implementation can detect these outliers on reads and then consolidates the corresponding data to fix the problem immediately.
Furthermore, thanks to the data localization and online consolidation, the Bw-tree nicely avoids some difficult problems in file-based LSM-tree implementations.
One of the well-known problems is unexpected cache eviction. File-based LSM-tree implementations usually cache files on a per block basis, which means that blocks in the cache are bound to the underlying files. This raises the problem that as files are constantly merged and discarded, the corresponding blocks in the cache must be evicted as well, even if these blocks are still very hot. So we may observe a lot of cache misses after every merge, affecting the stability of reads. Fortunately, this problem doesn’t exist in the Bw-tree. When the delta chain of a node is consolidated, the result is installed back in the cache immediately. So, as long as a node is hot, it will remain in the cache no matter how many times it is consolidated.
Another not-so-well-known problem of file-based LSM-tree implementations is unnecessary data movements. The problem occurs when files in adjacent components are false overlapped. Since files are relatively large (usually tens of megabytes), they may contain sparse data that spans an extensive data range. When a sparse file is merged with overlapped files in the next component, it may cause a lot of unnecessary data movements. This is because even though the ranges of these files overlap, the individual records inside them may have no intersection. Consequently, most data records are simply copied from the original file to a new one during merges. This problem can be alleviated by using smaller files to reduce the chance and the cost of false overlapped. If we go to an extreme by storing one record per file, there will be no unnecessary data movements. Unfortunately, this is unacceptable to most LSM-tree implementations due to various limitations, such as the file system. However, since data in the Bw-tree is localized per node basis, and nodes are so small (usually a few kilobytes) compared to files in the LSM-tree, the problem becomes very insignificant.
Finally, compared to the B-tree, if we put the idea of delta pages aside, the Bw-tree is simply a latch-free, write-friendly B-tree for SSD. And the additional delta pages allow us to trade off read and write amplification when necessary.
Based on our review of the B-tree, LSM-tree, and Bw-tree, we believe the Bw-tree is a more powerful data structure for storage engines that focus on modern SSD. The Bw-tree has the potential to combine the best of the B-tree and LSM-tree. And it provides the flexibility to trade off the read, write, and space amplification for different applications. Therefore, we think that the Bw-tree will be an excellent choice for PhotonDB in the long term.
At the moment, we are working on the very first version of our Bw-tree engine, and we will share more implementation details along with the development. If you are interested in the progress, you can check out the project.