Design Principles for HBase Key and Rowkey
Each of the HBase table at its heart is basically a HashMap relying on the simple concept of storing and retrieving a key value pair. Therefore, every individual data value is indexed on a key in HBase universe. Further, the HashMap is always stored (in the RAM or Disk) as a sorted Map, the sorting being based on the lexicographical ordering of the various labels that are present in a key. A HBase key is composed of following parts/labels:
Row Key — This part of the HBase key (represented by a byte [ ] ) comes first and labels/identifies a composite data point in a multi-dimensional data space. This can also be looked as a row identifier (row id) to a data row in traditional databases.
Column Family: This part (represented by a byte [ ] ) comes next to rowkey part and labels a multi-dimensional data space of which a composite data point is part of. Dimensions in this data space are logically related and each of this dimension is termed as column qualifier. HBase table allow existence of one or more column families comprising of one or more column qualifiers.
Column Qualifier: This part (represented by a byte [ ] ) comes next to the column qualifier part and labels a single data dimension in a multi-dimensional data space represented by a column family. Column Family along with a Column Qualifier can be looked as an identifier (column name) to a data column in a traditional database.
Time Stamp: This part (LONG_INT) comes next to the column identifier part and identifies the timestamp of the data value in a Key Value pair.
Key Type: This part comes last and labels the key type. This is mostly used internally by the HBase as a tombstone marker to mark the deletion of a Key Value pair or a column qualifier, etc., since HBase do not delete the stored Key Value pairs in place.
Data distribution in HBase cluster, to achieve reliable high throughput on read/write data operations, is also based on HBase Key. HBase region, which forms the unit of data distribution and scalability, is bounded by ‘start rowkey’ and ‘end rowkey’. Meaning, all the data cells that are identified by the same rowkey corresponding to a HBASE key lives in one HBase region.
Considering the facts about HBase Key and its importance in data scalability and distribution, here are the five design principles you could seek guidance from while finalizing the HBase key for a table:
Conciseness: Since every data value in HBase is stored as Key Value pair, it is recommended to choose concise names for column family and column qualifier. A single byte representing an English alphabet can be easily chosen as a name for column family whereas depending on the number of column qualifiers, appropriate number of bytes can be chosen for the later. Also, rowkey part of the HBASE key should be designed in a concise manner. Large integer values in the rowkey should be converted to bytes instead of value strings. Do not explicitly put the timestamp in the rowkey as timestamp is accompanied in HBASE Key implicitly. Do not put large data string values in the rowkey if they could be uniquely mapped to small ones by a suitable mapper function. Also, in case of limited set of large data string values, one can easily use a lookup table during run time to reference original large data strings from the small ones.
Uniqueness: Care should be taken to ensure that rowkey part of HBASE Key uniquely identifies your dataset. If you design a non-unique rowkey then you inadvertently write your different data sets against the same rowkey. These would be stored as multiple versions if the same is allowed for the table and each write differs in the HBase Key timestamp. However, if a single version is allowed, non-unique rowkey causes user to overwrite different data sets against the same rowkey. Uniqueness in the rowkey can be achieved by identifying a set of columns (or data dimensions) from your dataset that uniquely identifies each of your different data set. There could be many such sets, however, the best set should be concise in length and allow fast data accessibility to user. Many users use variety of hash functions in rowkey design to allow for uniform distribution of data however these functions are not one to one function and therefore relying on them completely could comprise the uniqueness of rowkey.
Data Distribution: As stated earlier, distribution of data across HBASE cluster happens on the basis of rowkey. It is very important that your rowkey design allows for data distribution across the cluster. Without uniform distribution of data across the cluster, scalability and performance would be compromised, since only few of the nodes in the cluster would contain the data. Bad design of a table rowkey causes read/write operations to get concentrated to only few regions of the tables (hosted on few nodes) which is widely known as ‘Hot spotting’. If sequential write operations result in monotonically increasing rowkey then it would lead to write Hotspotting because monotonically increasing rowkeys would target one region at a time. If a large data set is read from only few regions, it leads to read Hot spotting. Hot spotting leads to massive degradation in read/write throughput and parallelism aspect of Hadoop computing job operating on the table. To avoid Hot spotting, one can choose:
Hashing: Row key can be prefixed with certain bytes of hash computed over certain info (in the rowkey) that changes with each rowkey. Hash functions such as MD5 can be used effectively as they provide good entropy in their output. Use of hashed bytes as a prefix ensures that even with monotonically increasing rowkey info, the prefixed rowkey lands in different regions of the table thereby avoiding Hot spotting. Further, since hashing is deterministic in nature, it does not degrade HBase table GETS which lookup data based on a given rowkey. The range scan of monotonically increasing rowkey info can possibly degrade on single thread since data is now spread out based on the hashed prefix, however, multi thread access can be deployed to read this spread out data more efficiently.
Salting: Row Key can be prefixed with certain random bytes called salt. With salt bytes as prefix, monotonically increasing rowkey info also lands in different HBase regions thereby avoiding Hot spotting. However, since salt bytes are not deterministic, HBase table GET performance can get degraded since all possible HBase regions needs to be searched for the data against the rowkey (specified in a particular GET). Similarly, scan of data against a range of monotonically increasing rowkey info also suffers from the prefixed salt bytes, however, multiple scans running in parallel can now be efficiently launched thereby making a large range scan with prefixed salt bytes even more efficient.
Accessibility: Row Key should be designed in such a way that data should be efficiently accessible to user in accordance with most desirable way. Most efficient accessibility comes when a row key prefix is provided in a search query since this would limit the number of HBase regions (and the number of data storage files inside a region) to be scanned for the output data. There can be cases where stored info needs to be searched upon always against a particular type of data alone, then the first and foremost value in the rowkey should be of that data type. This ubiquitous data type in all the search queries can then be followed by other data type in order of their frequency of appearance across all or most important searches.
Tall vs Wide Tables: Row Key design also gets affected by HBase table design adopted by the user. In case of a tall table design, row key contains more data values as compared to row key in wide table design. More amount of data in row key leads to more number of row keys which in turn leads to more spread of data across HBase cluster since HBase region (unit of distribution) is divided across rowkey boundaries. On the other hand, less amount of data in row key leads to more data to be stored against columns which in turn leads to comparatively less data distribution but ensures atomicity across a wider data set. This is due to the fact that atomicity in HBase is ensured at the rowkey level.