| Under Construction This article is under construction |
Data structures and Search Algorithms are the most fundamental building blocks of computer science. Most applications use some form of data structures and retrieval. As a corollary to my search for the optimum data structure for large data storage for certain applications I decided to create a braindump of my notes for future reference
Hash Tables notes
- A hash table stores data in key value pairs, where a hash algorithm computes a hash from the key and this hash represents the index or bucket where the data is stored.
- An eg of a hashing algorithm can be a modulo of a number by the number of hash buckets that exist
- The hashing algorithm can generate collisions ie two keys can compute to the same hash value
- This would result in an identical index location for both those data items
- Typically both the items are then stored at the same logical index location in the form of a linked list
- Typically a good hash algorithm results in fewer collisions for a given data set
- While a hash retrieval is a O(1) operation, collisions can increase this significantly and the worst case scenario can be an O(k) operation where k<n and k=max collisions
- In some cases if there are collisions then the data items at that index location can be stored in the form of a Binary tree or a B+Tree, which combines the advantages of both data structures, making the best case as an O91) operaiton and the worst case as an O(log n) operation
- Hash tables are better than Binary trees as long as the hashing algorithm results in few collisions, since a hash retrieval is an O(1) operation
- They also therefore are extremely efficient when data is stored on disk, since they require minimal number of disk hops
- However if the amount of data is large, a hash algorithm will tend to have many collisions, as a result of which performance degradation is linear unless it a complex structure such as a Hash of Binary trees is used
Binary Tree notes
- This is one of the simplest and most efficient data structures
- Self Balancing Binary trees are those where the height of the tree is adjusted based on certain constraints after each insertion/removal
- Self balancing binary trees result in lesser hops being required to get to a leaf node
- Examples of self-balancing binary trees are Red-black tree, AVL tree, Splay tree, Treap etc
- Retrieval of data from a self-balancing binary tree is an O(log~2~n) operation
- Self Balancing binary trees are the most suitable data structure for in-memory data structures
- If however the data structure is to be stored on disk, a B+Tree maybe more suitable since it requires lesser node hops
B+ Tree
- A B+ Tree is a tree structure where the number of branches per node can be greater than 2
- All data in a B+ Tree is stored only in the leaf nodes
- Since each node has multiple branches, the traversal algorithm at each node is slightly more complex than a simple binary decision
- However since each node can have multiple branches a B+ Tree requires less node hops to find a particular data item
- A B+ Tree can also consume more disk space since many nodes will not be full as a result of which the space allocated for that node will be suboptimally utilized
- If the data is entirely in memory one may want to opt for a Binary Tree as opposed to a B+ Tree
- However when data is stored on disk or some slower media, a B+ Tree will typically perform better than a Binary Tree
- The primary difference is that in a Binary Tree the operation of determining the branch to traverse is a simple operation since each node has only two branches. But that results in an increase the number of hops required to be taken. On the other hand in a B+ Tree the number of hops reduces, but the computation required to determine which branch to traverse is a little more complex. If all the data is in memory, an increase in number of hops may not matter much, but if data is stored in a slower medium such as disk, then each hop requires mechanical operations which are an order of magnitude (read 100x) slower.
Search Algorithms
- Sequential Search
- Binary Search
- Interpolation Search
- Secant Search
Comments (3)
Jul 06, 2008
Rahul Puri says:
I believe this link might be able to add more to this blog post: \\ DataStructur...I believe this link might be able to add more to this blog post:
DataStructure Link
Aug 05, 2008
Anonymous says:
Well the worst case running time for the hash table in case of collision will be...Well the worst case running time for the hash table in case of collision will be O as you
mentioned only in the case when you use linked lists to resolve collisions well if you use
a BST or rather a balanced tree it would be O(log n).
Aug 26, 2008
Anonymous says:
Also it is often neglected that in hashing the key, depending on the hashing alg...Also it is often neglected that in hashing the key, depending on the hashing algorithm, can be as large as O where n is the length of the key.
And a hash table has to hash the key both for storage and for retrieval.