Bthash is a subroutine library for two types of databases, btree and hashing. The meaning of btree is balanced tree. The meaning of hash is to scramble. The data in a hash database is dispersed evenly throughout the data space. The hashing algorithm was created to prevent data from bunching in the dataspace like galaxies in the sky. The btree algorithm was created to prevent one side of the tree to become more populated with data than the other side. Thus each record in a leaf node has an equal path to the top of the tree.
Here is a comparison between Btree versus Hash databases.
Your telephone company keeps records of your phone calls by your telephone number. Your telephone number is the key to your telephone record. The key may also include the called number for your telephone call.
The Ford Motor Company keeps a record of every automobile made by vehicle identification number. You can trace the ownership history of your automobile by VIN. The VIN is the key to the record in the database.
The license tag on your automobile is randomly chosen for each registered owner. If you are driving between two countries, the customs agent records your license number coming and going across the border. The license number is the key to your property, the automobile. When the driver leaves the country, the record is deleted from the database. This type of database has a high turnover rate. There are thousands of inserts and deletes every day.
In a hashing database, the keys are all unique and randomly chosen. The sort order of the keys is meaningless.
You want to buy Shakespeare's Macbeth at the book store. You look for the book by author and title. The author and title is an example of a sorted key for a database in a book store.
You are visiting San Francisco, and you want to call your friend while you are there. You look up your friend's telephone number in the telephone book by last name, first name, street, city, and state. This is the key to the record in the telephone book that contains the telephone number. The key has to be sorted, so you can find it quickly.
You want to find the meaning of leukocyte in the dictionary. The word is the sorted key to the definition in the dictionary. The data portion of the record is the pronunciation, variant spelling, etymology, part of speech, and definition.
In a btree database, all the keys are unique and sorted. You can look up the key if you only know the first few letters. The btree database returns the next key greater than or equal to the key that you gave it. This is called a partial key look-up.
The hashing database stores the records randomly in the database, so they are spread evenly throughout the database and can be retrieved very quickly. The insertion program scrambles the key and then divides it by the total number of records. The remainder is the record address.
Frequently, several keys hash to the same address. This is called a collision. When this happens, the database creates a queue of all records with the same address. Normally the average queue length in the database is not greater than two overflow records per hashing address.
The btree database starts out with an empty root block. After the root block fills up with new records, it splits into three blocks. The left half of the root block becomes a new block, only half filled. The right half of the root block becomes a new block, only half filled. The middle record in the root block becomes promoted into the new root block. The new root block now only has one record in it, plus two pointers to the left half and right half of the original root block.
In a btree database, we say that the database grows upward. If you have two levels in the database and 100 records per block, you have room for 10 thousand records. If you have three levels, there will be room for 1 million records.
The higher levels in the btree are called index nodes, and the lowest level blocks are called leaf nodes. The btree is always balanced symmetrically, because each split creates new blocks of equal size. Each leaf node has the same number of levels to reach the root node.
All the records in the same block are sorted. All the records in a leaf node are to the left of an index node if they are lower in the sort order. They are to the right of the index node if they are higher in the sort order.
Here is a very simple example of three records and two levels.
B / \ A C
The original root block had three records in it, A, B, and C. Then the root block split and we now have three blocks and two levels. The records are sorted, so they can be read sequentially.
All the btree subroutines in bthash start with the letters bt.
All the hash subroutines in bthash start with the letter h. There are two exceptions to this. Ronald Rivest wrote some subroutines to assist in the hash. His subroutines are called getmd5 and md5c.
All the randomizing subroutines in bthash start with the letters rnd.