Skip to main content

Boeing B Tree's

In July 1970 Rudolf Bayer and Mcreight from Boeing Scientific research laboratories published the original B-Trees paper in the mathematical and information sciences report.  They never said what BTrees stand for, and it could well be Boeing Trees though Balanced Tree's is also a good reminder of what they are.

When studying BTrees before getting to the actual algorithm, wouldn't it be nice if we understand the exact motivation of the people who actually published the paper, the original people who thought of this idea.

They explain it in their paper, first just remember we are talking about 1970, this was before most of us were born, computers were slow, really slow.

In the paper they said that they are working on the organization and maintenance of index for dynamic random access file.  Let's dissect it, we know what index is, this is a thing that allows us to locate fast items in our data without scanning the whole data right.  So they want to understand and suggest better how to organize and maintain such an index.

So one of the key issues in btrees is not only search but sequential access to items once you have the page! if I'm in a node and this node has 10 keys then I can sequentially get the keys from smaller to larger in that page.

They use a random access backup store - I'm emphasizing this, they are talking about a backup store so this must be very slow they say it's like a disc or a drum.

They want to achieve from this index, retrieval, insertion and deletion of keys which all would be in logarithmic time relative to the size of the index.

In the actual experiments they did they had up to 100K indices, and they maintained it with their BTree method which they are going to present with at least 4 transactions per second, on an IBM 360/44 with a 2311 disc.

So in this paper they talk about how to dynamically maintain this index, using random access files, doing key deletion insertion retrieval paging, why paging? Because they can't hold everything in memory, so they need paging, this is a core concept of the BTree the fact you can't hold the whole index in memory and you need to page in and out.

So let's think what we want of such a datastructure and this would enable us to build this BTree before knowing what this BTree is about, at least except for the algorithms of insertion and deletion etc but mainly of the core structure of this BTree.

We want to be able to search in logarithmic time, we need therefore a kind of search tree, but in standard search tree if you insert all items in a linearly increasing way you might end up with a simple line tree, so we need to rebalance this tree we must otherwise if we don't rebalance this tree then it won't be logarithmic insertion deletion and search.

In addition, as our memory is limited, we want to store part of the index in memory, once we store some part of the tree in memory let's call it a page even if it contains multiple keys ordered, we could just linearly go through them.

Now why would we want more than one item in a single page, we not only 2 , because we store it on disk, once you store stuff on disk it becomes slow much slower than memory, the fact that disk is slower than memory means we want to store them sequentially so that at least moving from one item to another in a disk would be fast because we only need to move the head of the disk a little bit further and not scan or move through the whole thing, or even better if it's on the same block we could just read this page, and we read multiple items.

So now we understand that the tree need to be rebalanced and also that we want to store blocks so meaning each node of the tree would store multiple keys and not just once so that we are efficient with disc.

Within each page P the keys are sequential in increasing order x1, x2, x3, except for the root page, each such page contains l+1 pointers to p0,p1, ... pl to the sons of P.

I guess this was an impolitely correct era as they call the children sons! :)

And now to the actual algorithm for insertion deletion etc.

We want to avoid a whole line in tree we want it to be balanced tree not an unbalanced tree.

Unbalanced would be O(n) and balanced should be O(log(n)).

For search insert and delete operations.

Each note in BTree have more than 1 key.  for example root could have 2 keys and children of root should, could have more than 2 like 3 keys.

Each node can have 3 nodes, left would be smaller than all keys in current parent, all nodes in middle node would be like both smaller and larger meaning you could fit them in the middle of the current parent.
All nodes of the right child all bigger than the current parent node.

You have k+1 children for each node if a new has 2 keys then you would have 3 children, one more of children then the number of keys in each node.

All leaves would appear in same level in BTree we wanted balancing tree.

So you want to insert a new node you start from root and find the place, but if node is full you split the node to two, one node would have the larger values, the medium value would be moved up the tree.

Each BTree tree has what's called a degree, the degree is max children a node can have and it can have degree-1 keys in each node.

You insert a new number you move down the tree to find where it fits, if it's full you split and promote the median to parent node.

If you try to prompt and move the new key to parent node then you have to split the parent, but then once parent is being split if you can't just split it then you need to increase the height of the tree by one by creating new nodes and splitting and repeating this process.

So you first find the first leaf node you can insert, if can't insert it to this leaf split the node and then you could split parent and split grandparent etc.

Let's see an example let's create a btree of degree 3 which means 2 keys in each node and 3 children for each node.

We insert 5 we have a single root node with 5 all is fine.
Insert 10 - still all is fine we just have still the root node with both 5 and 10 keys however as the degree is 3 next time we want to insert a node we would need to split the root.

we insert 15 we split the root node which had 5 and 10 now it has only 10 which is median and on left child node we have 5 and on right children nodes we have 15.

Add 7 we start from root node 7 is smaller than 10, so we move below and add it inside the key sub child of 5 so left children has both 5 and 7

And now we add 4 it's less than root node of 10, so we move to the left which has 5 and 7 keys we add it but now we have 3 keys, so we split the node, and we pass the 5 which is the median to the parent.

So now our root node has 2 keys! 5 and 10! And 4 and 7 are on the left sub child.

To sum up

B trees is a self balancing tree data structure that mains search sequential access insertions and deletions in logarithmic time.


Popular posts from this blog

Dev OnCall Patterns

Introduction Being On-Call is not easy. So does writing software. Being On-Call is not just a magic solution, anyone who has been On-Call can tell you that, it's a stressful, you could be woken up at the middle of the night, and be undress stress, there are way's to mitigate that. White having software developers as On-Calls has its benefits, in order to preserve the benefits you should take special measurements in order to mitigate the stress and lack of sleep missing work-life balance that comes along with it. Many software developers can tell you that even if they were not being contacted the thought of being available 24/7 had its toll on them. But on the contrary a software developer who is an On-Call's gains many insights into troubleshooting, responsibility and deeper understanding of the code that he and his peers wrote. Being an On-Call all has become a natural part of software development. Please note I do not call software development software engineering b

Containers - Quick Low Level Guide

Containers Kernel, namespace, cgroups Kernel space and user space Before we actually get to explain containers let's define what is a kernel.  Because you know there is no such thing in reality as a kernel it's only how we name things, and different people name things differently. cgroups, namespaces, UFS We are going to discuss containers, cgroups, namespace, UFS, hypervisor, user space, kernetl space and more.   When we say "kernel" we mean this.  We have the hardware, this is not the kernel, now above the hardware we have a few layers of software, imagine now two boxes. User mode is all the application you run while the kernel is the lower level is all the virtual memory management scheduling, connection to hardware devices, network drivers, it's basically the abstraction on top of the hardware + the basic services which allow this. One box is closer to the hardware and contains a few layers, the second box sits on top of the kernel box and contains

API Design Paper Summary and Review

API Design Paper Summary Introduction Is building API a solvable question, how far can we get into having good API’s and what is a good API at all? these are all very hard questions, usually you know the answers once you designed multiple APIs and got experience and then reviewed the decisions you have taken. Fortunately there are papers dealing with this problem exactly, for complex API’s used by a huge amount of people, the Qt API for example a very populate framework for desktop GUI building, and today we are going to go through a summary of that paper. “The Little Manual of API Design” is a very nice paper written by Jasmin Blanchette has released a paper while working in trolltech, which is a Nokia company. I found it to be very clear and concise, and reassuring what we think of API design. It’s a difficult task that includes both artistic, social, programming and scientific skills. We are going to summarize this paper for you. When you write an API you combine a set o