Data structures

From NoskeWiki
Revision as of 13:26, 24 January 2020 by NoskeWiki (talk | contribs) (Binary Tree)
Jump to navigation Jump to search

About

Data structures are used to store and organize data in a computer so that the data can be accessed and/or searched efficiently. Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. Below is a brief account of some of the best known simple data structures in computer science.


Array

An array is a collection of elements (values or variables), with each identified by an array index... and in some cases multiple array indexes. The two main types of arrays are static and dynamic. Static arrays have a fixed length and typically stored in contiguous memory. The only way to accommodate more values is to allocate a new (longer) array and copy in values.

int firstArray[5];                  // C++ static array declaration with 5 not-yet specified values.
int secondArray[] = {5, 10, 15};    // C++ static array declaration with 3 values.
int matrix[3][10];                  // A matrix (2D array) with 3*10 values. matrix[1][2] = 13th element.
char ticTacToeBoard[3][3] = {{'x', 'x', 'o'}, {'o', 'o', 'x'},  {'x', 'o', ' '} };

Arrays can be single or multi-dimensional (read more).


Performance:

  • Fast: indexing, insert/delete at end (O(1))
  • Slow: insert/delete at start (O(n))

Dynamic Arrays

Dyanamic arrays - like "vector" in C++ or "ArrayList" in Java and C# - start at a small size and expand/shrink as needed. The array itself only contains references/pointers to elements, which are separate objects on the memory heap and not necessarily contiguous.

#include <vector>
...
vector <int> dynamicArr;         // C++ vector to store integers.
dynamicArr.push_back(3);         // Adds first element onto the vector.


Bit Vector

A "bit vector" is a array of bits (each bit either 1 or 0) and thus optimized for space allocation: each element occupies only one bit (eight times less than a char/byte in c++). Bit vectors can be very useful to sort and represent arrays of unique numbers as demonstrated here:

  • An array of 5 unique integers between 0 and 15: (0,1,3,6,8,9,15) can be stored in just 16 bits as: (1101001011000001)

In the example above the bit array occupies just 2 bytes, versus 20 bytes for an array of five integers (depending on the compiler/architecture, an integer in c++ is usually 4 bytes). In terms of complexity, a bit vector can sort, readout and store a list in 0(r) time and O(r) space where "r" is the range. The bit vector can be very powerful, but keep in mind that it only works if all values are unique our the range (r) is limited. If we have a 1 thousand integers ranged over 1 billion possible values, our bit vector need over 100 MB (1 billion/8/1024/1024), and thus it would make far more sense to just store an array of 4 byte integers using 4 KB total. Thus bit vectors are also most useful when the values are dense (when n is relatively close to r). In cases where value are not unique, an array of bytes can potentially be used as a "frequency" to show how many of each value exists, but then this assumes there are no more than 256 of any one value.

In C++, the "bitset" container provides a relatively efficient bit vector which is initialized to some size and then bits turned on an off with set() as per this little example:

#include <bitset>
...
bitset<4> mybits;
cout << mybits.set(0,true) << endl;    // 0001  -> NOTE: "bitset" values are accessed from the rightmost value.
cout << mybits.flip(2)     << endl;    // 0101.


Performance:

  • Very fast: insert/delete (O(1))
  • Variable: sort (O(r))


Linked List

A linked list is a group of nodes with each node containing an object and a pointer to the next node. In a doubly-linked list each node also points to the previous node. This structure allows for efficient insertion or removal of elements from any position in the sequence.

In C++ you can include <list> (an STL doubly-linked lists), or use "LinkedList" in C# or LinkedList in Java.

#include <list>
...
list<int> L;
L.push_back(0);              // Insert a new element at the end.
L.push_front(0);             // Insert a new element at the beginning.

The absolute simplest implementation of your own single linked list typically looks like this:

struct Node()
{
  int value;                 // If our value is an int, but could be anything.
  Node *next;
}

Although typically there is some "LinkedList" class which will always keep a pointer to the head node (hopefully the tail node too) and have useful functions such as returning the size of the list (hopefully something it keeps track of as nodes are inserted and deleted).


Performance:

  • Fast: insert/delete at start or end (O(1)) ... although involves search time if in middle
  • Slow: indexing / random access (O(n))


Vector vs. Linked List Using the C++ stl libraries, both the "list" and "vector" are very easy to use. Above you've seen the pros and cons of each structure - vectors are far better at random access while linked lists are far better at arbitrary inserts. Even in cases where you do a lot of inserting however, a vector can often outperform a list (provided it's not 100,000's of elements long). Even though they need to do more "work", small vectors can usually outperform list due to these three important factors:

  1. Fewer memory allocations: Vectors allocate memory for multiple elements at once, which is faster than making allocation each time (eg: a c++ vector may start 10 long then doubles each time)
  2. Better caching: Vectors live in contiguous memory, meaning they are accessed more predictably and are more likely to exist in a lower level of cache.
  3. Less space: Unlike linked lists, vectors store only the values. In a linked list, each node requires 1-2 pointer, so if your data is only a few bytes a linked list may occupy twice as much space.

Stack*

A stack is a last in, first out (LIFO) linear data structure. Items are "pushed" onto the top and also "popped" off the top - so the lower the element the longer it has been on the top. A stack is a restricted data structure, because only a small number of operations are performed on it, for example the C++ "<stack>" class only has "push()", "pop()", "top()" plus "empty()" and "size()".


Queue*

A queue is a first-in-first-out (FIFO) linear data structure where the the first element added to the queue will be the first one to be removed. Basically you "enqueue" objects at the back and "dequeue" at the front - just like people lining up at a bank. The enqueuing and dequeuing is achievable in in O(1) time and often implemented using a linked list. C++ has a "<queue>" class, and also a deque (double ended queue) where elements can be added or removed from either end. In terms of implementation, both a stack and queue can be quite easily coded as either a vector or list. In the case of a queue implemented as a vector, one can reduce insertions to the front to O(1) by using the array like a "buffer" - using two extra variable to keep track of the "front" and "end" and iterating these forwards whenever an element is dequeue or enqueue, respectively. Both the "front" and "end" index should loop at the end (meaning they'll eventually overwrite old values) and if they overtake the vector should be increased in size.


Priority Queue*

A priority queue is an abstract data type which is similar to a queue or stack, except that each element has a priority and these are extracted highest-priority-first. Priority queues usually have a "insert" and "extractmax" function and are well suited for implementation using a "Heap" data structure (which is explained soon), although are sometimes implemented in other ways too.


Tree

A tree is a hierarchical structure of linked nodes - an connected graph where each node has zero or more children nodes and at most one parent node. For best results the tree should be self balancing, allowing all operations (insert, delete and search (assuming it is sorted)) to operate in O(log n) time.

Performance:

  • Medium-Fast: insert/delete, indexing / random access (O(log n))

There are many types of trees, here are just a few:


Binary Tree

A binary tree is a tree data structure where each node has at most two child nodes.

You can implement your own with:

struct Node {
  int value;
  Node *left;
  Node *right;
  // Node *parent;  // Add this for a BST.

  Node(int value) : value(value), left(nullptr), right(nullptr) {}  // Constructor.
  // ~Node() { if(left) delete left; if(right) delete right; }
};


// And if you want to contain it:
struct Tree {
  Node* root;

  // ~Tree() { if (root) delete root; }
  void insertNode(int value) {
    insert(root, value);
  }

  void insert(Node* n, int value) {  // Recursive call.
    if (n == nullptr) {
      n = new Node(value);
      cout << value << " added\n";  // Debug output.
    } else if (value < n->value) {
      insert(n->left, value);
    } else {
      insert(n->right, value);
    }
  }
};

int main() {
  Tree tree;
  tree.insertNode(8);
  tree.insertNode(11);
  tree.insertNode(5);

  return 0;
}

Binary Search Tree (BST)

A binary search tree (BSD), also known as a sorted binary tree is a binary tree where every child node satisfies these conditions:

  • the left subtree contains only nodes with keys less than the node's key.
  • the right subtree contains only nodes with keys greater than the node's key.

This makes it possible to find any particular value in O(log n).


Balanced Binary Tree

A relatively balanced binary tree (one where the leaf nodes are mostly at the same depth) can also be implemented simply by a 1 dimensional array (like a vector). The left child of a node N (where N = index + 1) will be T[N * 2] and the right child will be T[N * 2 + 1].


Red-Black Tree

A red-black tree is a self-balancing binary search tree which assigned each node an extra boolean "black or red" value and inserts and deletes in such a way that the tree is always reasonably balanced. The conditions of the tree are:

  1. A node is either red or black.
  2. All leaves are the same color as the root (usually black) and leaves can be null (in fact most will probably be null).
  3. Both children of every red node are black.
  4. The path from a given node to any of its descendant leaves contains the same number of black nodes.

Understanding how the trees stay balanced is a bit tricky, but something interesting is that the STL set and map are typically implemented as red-black trees. A "set" is a class which only accepts unique "key" values. A "map" also only accepts unique "key" values, but each element also has an associated "mapped" value which may differ. If you want multiple of the same key you could use the multimap container instead.


Trie

A trie is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Each node might represent a character and there may be many children, each representing the next letter such that traversing from root note->"t"->"e"->"a" gives a value for the word "tea". A trie has some advantage over a BST and hash tables in that it does not need hash function/collision or re-balancing. The main disadvantage is that it can become very large for a large string or key.

struct TrieNode() {
  bool isWord;           // If true, the concatenation of this and all its parents forms a word.
  TrieNode *child[26];   // A non-null node represents a path to that character where (0=a) and (25=z).
};


Heap

A heap is a specialized tree-based data structure that satisfies the heap property: the value of each parent is always greater than each of its children. This implies that an element with the greatest key is always in the root node - such a heap is called a max-heap (as opposed to min-heap where the smallest key is on top). The most efficient heap structures also maintain a shape property: the leaf nodes on the lowest level are as far left as possible, and all other levels are full.

If these two properties are maintained, a heap can be efficiently stored in an array such that element 0 is the root, 0's children are at 1&2, the children of 1 and 2 are 3&4 and 5&6 (respectively) and so on. The two main operations a heap should perform are "insert" and "extractmax()". To achieve an insert the heap simply adds the new element at the end, then does a "siftup"... a process where the new element get swapped up the tree (towards the root) with any parent with a small (<) value. When an element is removed the first element is replaced with the last element and then this root nodes undergoes a "siftdown"... a process where it gets swapped with whichever child has the larger value, so long as the child value is larger (>) than the value we are sifting down. Below is an implementation of a heap in C++:

template<class T>
class Heap
{
private:
  int n;    // current number of element in the heap
  T *d;     // our array of data (could use a vector if you prefer)

public:
  Heap(int maxSize)  { d = new T[maxSize]; n = 0;  }
  int size()         { return n; }
  void insert(T newValue) {
    d[n++] = newValue;
    int i=n-1, p=i/n;         // Where p represents the parent of node i.
    for(; p>=0 && d[i] > d[p]; i=p, p=i/2)  // Perform "siftup".
      swap(d[i], d[p]);
  }
  T extractmax() {
    T maxVal = d[0];
    swap(d[0], d[--n]);
    int i=0, c=1;       // where c is the 1st child of node i
    for(; c<n; i=c, c=2*i+1)          // Perform "siftdown".
    {
      if(c+1<n && d[c+1]>d[c])   // If 2nd child is larger: use this instead.
        c++;
      if(d[i] >= d[c])
        break;
      swap(d[i], d[c]);
    }
    return maxVal;
  }
};

In c++, the <algorithm> library provides "make heap", "pop heap", "push heap" and "sort heap" which can all be applied to a vector.

NOTE: A heap data structure should not be confused with the heap which is a common name for dynamically allocated memory. NOTE: A heap can be used in a heap sort algorithm.


Graph

A graph is an abstract data type consists of nodes and edges - i.e: vertices and pair-wise connections between the nodes. It's often common that there is some weight associated with each edge.

NOTE: Typical higher-level operations associated with graphs are: finding a path between two nodes, like depth-first search and breadth-first search and finding the shortest path from one node to another, like Dijkstra's algorithm - which is typically an O(v2) algorithm.

template<class T>
struct Node() {
  int T;          // Each node usually has a value.
};
struct Edge() {
  Node *a;        // |-- Each edge connects two nodes
  Node *b;        // |   (sometime directed a>b, and sometime both ways).
  float weight;   // ... and often a weight.
};
struct Graph() {
  vector<Node> node;  // |-- And our graph contains an array
  vector<Edge> edge;  // |   of nodes and edges.
};

Undirected Graph

An undirected graph is one in which edges have no orientation. The edge (a, b) is identical to the edge (b, a).

Directed Graph

In a directed graph, each edge/arc may point in one or both direction.



Traversing Trees and Graphs

Depth-first search

Depth first search (DFS) is a traversal algorithms where you start at the root (in the case of a graph selecting some node as the root) and explore as far as possible along each branch before backtracking. DFS is typically used to traverse an entire graph, and takes (O(V+E)), linear in the size of the graph.

DBF is usually done using a stack to track which nodes to search is quite easy to do using recursive function. For example, The code to search all nodes in an unsorted binary tree is:

int dfsCountMatchesInBinaryTree(Node *n, int searchVal)
{
  if(n==NULL)
    return 0;
  
  int isMatch = (n.value == searchVal) ? 1 : 0;
  int sum = isMatch
            + dfsCountMatchesInBinaryTree(n->left,  searchVal)
            + dfsCountMatchesInBinaryTree(n->right, searchVal);
  return(sum);
}

NOTE: If the graph has cycles it becomes necessary to remember which nodes have been visited to avoid an indefinite loop, as demonstrated in the next example. Unfortunately this would also require a separate call to make sure all the "visited" values are reset to false before starting.

Node* dfsOfCyclicGraphToFindAnyMatch(Node *n, int searchVal)
{
  if(n==NULL || n.visited==true)
    return NULL;
  n.visited = true;             // NOTE: if just searing a tree we don't need to worry about this.
  if(n.intVal == searchVal)
    return n;
  else
  {
    for(int c=0; c<n.child.size(); c++)   // Unlike a binary tree there may be more than two children.
    {
      Node *result = dfsOfCyclicGraphToFindAnyMatch(n.child[c], searchVal);
      if(result != NULL)
        return (result);
    }
    return NULL;
  }
}
 
Node *node = depthFirstSearch(tree->rootNode, 42);  // Will return first node with '42' or null if no such node.


Breadth-first search

Breadth-first search (BFS) is a graph search algorithm that begins at the root node and explores all the neighboring nodes. From the standpoint of the algorithm, all child nodes obtained by expanding a node are added to a FIFO queue.

Performance:: O(b^d) worst case with a branching factor b and graph depth d.

BFS and DFS are similar (see here), but one big difference is that DFS is usually done using a stack and recursive function, BFS can only really be done with a FIFO queue and not easily coded with a recursive function.


Hash Table

A hash table is a data structure that uses a special "hash function" to map identifying values, known as keys (eg: names), to their associated values (eg: phone numbers). The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought. The hash function should ideally map each possible key to a unique slot index, but this is rarely achievable in practice. Instead, most hash table designs assume that hash collisions between different keys that map to the same hash value will occur and must be accommodated in some way - often by forming a linked list between entries at each bucket.

Performance:

  • Fast: insert/delete, search (O(1)) if perfect hash function, but more like O(1+n/k) where k is the number of collisions and (O(n)) worst case (everything in one bucket).
  • Slow: iterating.. and two disadvantage: (1) need good hash function and (2) large amount of memory.

In C++ the best option is a <unordered_map> class which looks like this:

#include <unordered_map>

using namespace std;

// Counts the number of sock pairs where each element in
// the input array (ar) is a different sock color.
int countSockPairs(vector<int> ar) {
  unordered_map<int, int> socks;  // Sock frequency as: <color, number>.
  for (int i=0; i<ar.size(); i++) {
     int color = ar[i];
    socks[color]++;
  }
  int num_pairs = 0;
  for (auto sock : socks) {
    num_pairs += int(sock.second / 2);
  }
  return num_pairs;
}

Another options is <hash_map>:

#include <hash_map>
#include <string>
using namespace std;
using namespace stdext;

class Sample {
public :
  string mGenre;
  int mAge;
  Sample(const string genre, int age) : mGenre(genre), mAge(age) {};
  void toString(...);
};

size_t hash_value(const Sample& a) {       // Hash function defined here.
  return hash_value(a.mGenre) ^ hash_value(a.mAge);
}

void main()
{
  hash_map<Sample,int> bookHash;
  bookHash[ Sample("Fantasy",15) ] = 15;
  bookHash[ Sample("Sports", 18) ] = 12;
  
  if(bookHash.find(Sample("Sports", 18)) != bookHash.end())
    cout << bookHash[ Sample("Sports",  18) ];  // prints '15'
  
  for(hash_map<Sample,int>::iterator ii = my_map.start; ii != my_map.end(); ii++)      // To print whole hash map.
    cout << "Entry for key '" << ii->first.toString() << " is " <<ii->second << "\n";
}

In C# there is a "Hashtable" class which allows you to add any types of keys and values using "Hashtable hash = new Hashtable();", but if the types are known the "Dictionary" class is typically more efficient.

using System;
using System.Collections;
class Program {
  static void Main()
  {
    Dictionary<string, int> d = new Dictionary<string, int>();
    d.Add("cat", 1);
    d.Add("dog", 2);
    
    if(d.ContainsKey("cat")) {       // Also has "ContainsValue()" method, but value searches take O(n).
      int value = d["cat"];
      Console.WriteLine(value);
    }
  }
}


Strings

A string is basically array of characters. While these "characters" could be almost anything, they typically represent ASCII or Unicode characters - and thus the word "string" is typically associated with a representation of text strings such as "Hello world!". In C++ there are two types of strings:

  1. the standard "C string" - "char myString[50];" which is terminated with a NULL '\0' character
  2. the "STL string" class - "string myStr;" which expands to any size

A c string is simply an array of chars (char myCString[] = "hello world";), each one byte (ASCII only) and with a NULL '\0' character signalling the end of the string, and is often represented just as a pointer to the first character (char *ptrToString = &myCString). The C library "<string.h>" provides several string manipulation functions such as "strlen", "strcat", "strcpy", "strncmp" and several others. An example of using a c strings follows:

char str1[] = "string";
char str2[200];                // |-- NOTE: people will often use a "const int MAX_STR_SIZE = 1024;" then 
char str3[200];                // |   "char str2[MAX_STR_SIZE];" depending how big they think or know their c string may grow
 
strcpy (str2,"this ");                 // str2 is now "this".
strcat (str2,str1);                    // str2 is now "this string".      
strcat (str2," is concatenated");      // str2 is now "this string is concatenated".      
 
if(strncmp(str1,str2) > 0)           // Returns 0 if string equal, >0 if str1 lower, <0 is str2 higher.
  printf("'%s' is 1st alphabetically", str1);   // will print "'string' is 1st alphabetically".

strcpy (str3,str1);                    // str3 is now "string".
for(int i=0; i<strlen(str3); i++)      // For each character in str3:
  if(str3[i]=='t')                       // replace 't' with 'p'.
    str3[i] = 'p';            
str1[3] = '\0';                        // A this stage str3="spring" and str1="str".

To allow for Unicode characters and more advanced functionality the C++ "<string>" library gives you access to a C++ string. The c++ string class behaves like a vector by keeping track of its length and re-sizing when needed and has about 30 different methods including "length", "resize", "append", "operator+", "substr", "compare", "find_first_of", "replace", "insert" and so on.

string str1 = "string in c++";
string str2 = "this ";
str1[1] = 'p';                      // str1 is now "spring in c++".
str2.append(str1, 0, 6);            // str2 is now "this spring" (appends 6 chars from pos 0).
string str3 = str3.substr(1,3);     // str3 is "his"             (3 chars starting pos 1).
string str4 = atoi(str2.length());  // str4 is "8".
char *cStrPtr = str2.c_str();       // Converts str to a standard c string (which you *may* sometimes have to do).

Due to the ability to re-size dynamically, "c++ string" are generally preferable and easier to manage than c strings, but you may occasionally have to convert from one string type to in places.


Suffix Arrays

A suffix array is an array of integers or points to starting positions within a single string. Pre-computed and sorted suffix arrays can become very useful when trying to quickly find sub-strings in text - similar to what happens when you type a word into a search engine. Let's consider this example text:

"to be or not to be"
 012345678901234567      (18 chars)
 *  *  *  *   *  *       (6  words)

To turn this into a suffix array we first create pointers to characters, and then we typically sort this array as shown below:

       0,3,6,9,13          -> 16,2,9,6,13,0

       BEFORE SORTING:            AFTER SORTING:
  [0 ] to be or not to be    [16] be
  [3 ] be or not to be       [3 ] be or not to be
  [6 ] or not to be          [9 ] not to be
  [9 ] not to be             [6 ] or not to be
  [13] to be                 [13] to be
  [16] be                    [0 ] to be or not to be

Sorting the suffix array can be achieved in O(n log n) and, once sorted, it only takes a O(log n) search to quickly see if any word(s) exists, and immediately where (in a potentially very long string) all the occurrences are located. The word "to" exists at position 13 and 0. Using a suffix array we can also identify the "largest common prefix", which in this case is "to be", in O(n) time. Finding duplicates can be achieved in O(n) time too, and here we can see the words "be" and "to" both occur twice. In this example you'll notice we only index the start of each word, but if we wanted we could place a pointer at every character and still only require "O(n)" space.

  • Performance:
    • Fast: find (O(log n))... and only O(n) space
    • Medium: sort (O(n log n))... but only need to do this once

NOTE: To even more efficiently search for individual word, we could build a hash table (O(1) search time) for each word.

To build a suffix array in C:

#define MAX_STR = 1000000;     // Let us assume our total text won't go over 1 million characters.
char text[MAX_STR];
char *suffixArr[MAX_STR];
...
int n = strlen(text);          // Actual number of characters in our text.
for(int i=0; i<n; i++)         // Build suffix array using every char in text:
  suffixArr[i] = &text[i];
suffixArr[i] = NULL; 
sort(suffixArr, n, pstrcomp);  // Sort suffix array.

Notice here we are interested in the "character" level so we've create one pointer per character, instead of one character per word (where each word is separated by a white-space character), but either case will take up only O(n) order of space where n is our number of characters. While this example was applied to strings, note that you can apply to any other type of array, so long as you keep a sentinel value at the end so that you don't process too far.


Extra Topics

C++ STL Data Structures

Most of the data-structures you'll read about here have been implemented very nicely in C++ Standard Template Library (STL). Almost all of them are dynamic in that they will resize when needed. Almost all of them have a call to "size()" (to report how many elements there are), and many will have extra calls like "reserve()", "capacity()", "resize()", "empty()" and "clear()".

Some of the STL data structures I've mentioned have been:

  • std::vector - a dynamic array of any desired element.
  • std::bitset - a vector of bits.
  • std::queue - a FIFO abstract data type with: "front()", "back()", "push_back()", "pop_front()".
  • std::stack - a FILO abstract data type with: "top()", "push()" and "pop()".
  • Tree-like structures:
  • std::set - each element has a unqie key, and is typically impelemted in a balanaced tree for O(log n) "insert()" and "find()" time.
  • std::map - similar to set, except each each element has a key and mapped value pair. Keys must still be unique, unless a "multimap" is used.

The code below shows the usage of sets and maps:

struct ClassComp {
  bool operator()(string s1, string s2){ return s1.compare(s2) < 0; }
};
void main()
{
  map<string,int,ClassComp> month;  // For days in month with O(log n) search.
  month["jan"] = 31;
  month["feb"] = 28;
  ...
  if(mont["jun"])
    cout << "june -> " << month["jun"] << endl;
  
  set<int> uniqInt;
  int i;
  while(cin >> i)
    uniqInt.insert(i);
  cout << "Read in " << uniqInt.size() << " unique ints";
}


Converting numbers to string

The object oriented languages Java and C# make heavy use of ToString() functions, but in C++ it's a little more annoying. The "<stdlib.h>" standand C library (which you don't have to include) provide a few little functions "atoi", "atol", "atof" to converts integers, float etc to a standard C string, but a more versatile option is to #include "<sstream>" and use the stringstream class via an inline template function:

#include <sstream>
template <class T>
inline std::string toString (const T& t) {
  std::stringstream ss;
  ss << t;
  return ss.str();
}
int    my_num = 40+2;
string my_str = "The answer is: \t";   // Where '\t' is a tab and (although not used) '\n' is newline.
my_str       += toString(my_num);      // Our my_str is now "The answer is:     42".


See Also


Links

...