Why HBase …. 10 Key Benefits

1. Open Source, Scalable, Distributed and Resilient: HBase being built on open source components is a open source software supported by open source community. Further, being built on the top of HDFS (as a data storage medium) it automatically provides resilience and reliability to the stored data. Further, similar to HDFS, HBase too is distributed and highly scalable in nature and supports high resiliency and reliability in accessibility. It is being used to store and access massive data sets with ease. Further, it can easily coexist within existing Hadoop deployment and is easily integrable with existing Hadoop tools.

2. Random Access on the top of HDFS: We all know that Hadoop Distributed File Systems (HDFS) is primarily designed for sequential read/write access and does not support random data access patterns. However, HBase, even when being built on the top of HDFS (for its data storage engine), provides the user with this random read/write access capability. This is a huge advantage, since along with the benefits of HDFS, user with HBase can now randomly access data efficiently for read/write operations. Random read/write access pattern are common operations while designing modules over Hadoop ecosystem. HBase architecture is superbly designed around the sequential access capability of HDFS to expose random access capability to the end user.

3. Scalable Efficient HashMap: HBase design mimics a giant, scalable, distributed and efficient HashMap with which we all are familiar with. Similar to a HashMap, central to the HBase is a key value concept, where the key is commonly called as ‘rowkey’ against which a set of data is stored. Lookup of data against rowkey(s) is very efficient in HBase similar to a HashMap. In most of the random operations, we need to often lookup stored data set against info which already forms the part of that dataset, and therefore HashMap is the most suitable data structure for random access patterns. Considering this fact, HBase has also adopted the same in its design philosophy.

4. Inherent Time Dimension: Contrary to popular databases, HBase is in inherently designed to index the stored data with time thereby enabling the user to efficiently store and retrieve multiple snapshots of data at various times against a data cell represented by a unique rowkey and a column. This is very useful in efficiently storing and retrieving the time series data/events which is very common in the industry. Time indexing can be done by default or manually on the basis of timestamp which is the difference, measured in milliseconds, between current time and midnight Jan 1, 1970 UTC. Each of the timestamp is also called as one version of data, and users can specify the maximum versions of the data to be kept in HBase table. Further, the HBase client APIs provides the user with the lot of options to retrieve data relevant in a particular time window with ease.

5. Flexible Schema: HBase provides very flexible approach in defining the shape of data or interpretation of stored data thereby making it nearly schema less. For a particular data set indexed by a rowkey (representing a row in a traditional database) data dimensions (represented as columns in a traditional database) can be increased or decreased dynamically on the fly. No need to run special alter queries on HBase tables to alter data dimensions. Further, HBase do not require any pre-provisioned metadata with the HBase tables to interpret data types of various columns. A user is free to interpret the data type associated with a column (data dimension) in a HBase table.

6. No Null Values storage: HBase does not store null values explicitly corresponding to data dimensions (columns) in a data set indexed by certain rowkey (representing a row). Therefore, it is very beneficial in storing the data in table where values of certain columns are not mandatory. This feature saves lot of space consumed by a particular table where sparsity quotient of the data is very high.

7. Inherent Support for Hadoop computing frameworks: Being built over the Hadoop Distributed File System framework, HBase naturally lends itself to Hadoop parallel computing frameworks such as MapReduce and Apache Spark. Spark and MapReduce programs can very easily access data in parallel fashion from HBase table(s) for parallel processing.

8. Comprehensive JAVA and REST APIs: HBase supports both JAVA and REST comprehensive APIs to access/store data, create/alter HBase tables and monitor/perform HBase cluster operations. Being detailed and comprehensive, one can build custom libraries and applications on the top of these APIs according to their needs.

9. Comprehensive Filtering along with Filter push down: HBase supports, similar to most other databases, comprehensive filtering framework to filter relevant data. Also, along with this, it provides support for filter push down to distributed regions managing data of interest. These capabilities greatly reduce the data travelling over the networks towards the HBase client side. With these filtering capabilities, HBase clients can work only with data of interest without spending time on explicit filtering.

10. Support of Co-processors: HBase also supports the co-processor framework to allow for custom code to run on distributed regions managing data of interest. The custom code can be made to run on demand by the HBase client via set of dedicated APIs, or the custom code can be made to run on calls invoked on the regions when existing data access APIs are invoked by the HBase client. This custom code framework could be used to implement default filtering scenarios, dependency injection or computing aggregated values.

Big Data Architect, Apache Spark Specialist, https://www.linkedin.com/in/ajaywlan/