Describe a Tree (first attempt)

Everything big tends to be an array, a hash table, or a tree. And hash tables and trees are really arrays when you get down to it, and arrays are actually trees.

How it all works is everything looks like an array, which is a bunch of things one after another. An array lets you look up what's at position #45993 and it looks up #45993 as fast as anything. An array is just a list of things one after another, positions 0 1 2 3 4 and so forth, so you put things in it and you can look up what's at position #45993 fast as anything. But really, it isn't all one big long array ... no I'm getting ahead of myself.

A bunch of things you want to write down and look up aren't numbers. They're things like "John Smith." So you get a function, a "hash", that can take any phrase and turn it into a number. That way you can take "John Smith," hash it to get a number, use that number as the position to stick John's stuff, then later take "John Smith" again, hash it to get that number again, and look up John's stuff. That's a hash table. It's an array plus a hash.

But what if you want to look up all the Smiths, not just John? Hashes scatter the Smiths all over the place. You could look at all the billion spots in an array or hash table and see if it's a Smith, but that's slow. So you don't use hashes if you need to scan all Smiths. Faster is to put things in an array in order. Once you find the first Smith you just walk up the array and you find all the others, because sorting puts them all together. But how do you find the first one? If there's a billion spots, check if "Smith" is bigger or smaller than the half-billionth. Now you've only got a half-billion things to check, so you look in the middle and do it again, and again ... turns out if there were a billion things, you can find one particular thing after doing that 30 times. That's "binary search" on an ordered array.

This is all peachy if you've got a billion things all in order. But if you want to add a new thing in the middle, you've got to shift a half-billion things over by one to make room, which takes a long time.

Which brings us to trees. B+ Trees in particular.

They're a bunch of little arrays. Say 10 things each. If you've got under 10 things, it's just an ordered array, and if you want to add something in the middle and it's not full yet, you've got to shift about 5 things over to make room. But what if you've got 100 things? Then you get 10 of these little arrays with 10 things each, call them the leaves, and one top array of 10 things, call it the root. Put the first 10 in the first leaf, second 10 in the second, etc, and in the root, store just the starting point of each leaf in order. Instead of storing real stuff in the root, have each spot store how to find the leaf underneath it. To find something, do binary search on the root, that tells you what leaf to look in, then do binary search in the leaf. If you have 1000 things, have 10 of these root+leaves, and a new root above it, making the old roots just branches. And so forth. It's like a pyramid, where each stone rests on top of 10 stones underneath it. Or an upside-down apple tree, where you start at the root and flow down through splitting branches to get to the right leaf. Instead of arrays of 10, you could use arrays of 100 or 2048 or whatever is convenient. Also, link all the leaves together, so if you find the first "Smith", you can scan just its leaf and skip to the next leaf and so forth to find all the Smiths, rather than going down from the root every time. What makes a tree a B tree is that all the leaves and branches have like 10 things in them not just 2, and what makes a B tree a B+ tree is having the leaves all hooked together.

The place B+ trees really shine is when you want to add things. With a billion-item sorted array, adding something in the middle meant shifting over half a billion things. With B+ trees, it means shifting over 5 things in the leaf. If it gets full, you split it into two leaves next to each other and add one entry to the branch above them. If that gets full, you split it and add one entry in the branch in the next level up, and so forth. Adding and deleting things usually only costs as much as shifting things in leaf, and at worst it can only cost as much as shifting over things in one leaf, times the number of layers. If each little array has 2 things, a billion-item tree would be 30 layers deep. If each little array has 10 things, it's only 9 layers. If each little array has 1000 things, a billion things is only 3 layers.

Which brings us back to arrays. Arrays aren't really a billion billion long. They're really a bunch of little arrays. When you look up #402939842, you don't really just jump there. Instead, there's a bunch of little arrays all hooked together in a tree that's just pretending to be one big array that's billions long. And hash tables, you thought they were really a huge array plus a hash, well really they're a tree of little arrays pretending to be a big array, plus a hash.

And there you go. Arrays, hash tables, and trees. That's what all really big things look like.

This was in response to a prompt on reddit.com r/WritingPrompts, "Describe a tree."

Index of stories
Bob's web page