Category Archives: Hbase

Infinispan

Infinispan is an in-memory,distributed,key/vale (NoSql) Cache written for JVM.Infinispan is a clusterd cache enabling to scale horizontally by adding multiple node’s/computing power
You can configure various parameters in infinispan like
1) When to persist the data in back-end
2) When to purge the expired Entries
3) How many entries to keep in memory
Infinispan Ships with support for many backends like for
Cassandra
JDBC
SingleFileStore
HBase
LevelDbStore
I have integrated Infinispan 5.0.x with Hbase 0.94.12 and posting the code here


public class HbaseTest {

public static ConfigurationBuilder getConfigurationBuilder() throws IOException{
ConfigurationBuilder b = new ConfigurationBuilder();
File file = new File("/root/Downloads/infinispan-quickstart-master/embedded-cache/src/main/resources/hbase.properties");
FileInputStream fileInput = new FileInputStream(file);
Properties properties = new Properties();
properties.load(fileInput);
b.versioning().disable();
b.loaders()
.addStore(HBaseCacheStoreConfigurationBuilder.class).hbaseZookeeperQuorumHost("192.xx.xx.xx").
hbaseZookeeperClientPort(2181).purgeOnStartup(false).purgerThreads(10)
.async().flushLockTimeout(10000l).threadPoolSize(10).withProperties(properties)
.autoCreateTable(false).entryTable("InfinispanTest").entryColumnFamily("sd").
expirationTable("InfinispanExpired").expirationColumnFamily("sd");//defining HBase specific configurations
b.loaders().passivation(false);
b.eviction().maxEntries(10000000);

b.expiration().wakeUpInterval(100, TimeUnit.SECONDS).reaperEnabled(false);
b.jmxStatistics().enabled(true);
return b;
}

public static void main(String args[]) throws IOException, InterruptedException, CacheLoaderException{

Scanner scn=new Scanner(System.in);
Configuration ca=getConfigurationBuilder().build();
DefaultCacheManager cm=new DefaultCacheManager(ca);

Cache c=cm.getCache("example")// gets a new cache instance with name "example"

.getAdvancedCache().withFlags(Flag.IGNORE_RETURN_VALUES);
long beforeTime=System.currentTimeMillis();

for(int i=0;i<100;i++){ c.put("e_"+i,100+""); } System.out.println("Total time "+(System.currentTimeMillis()-beforeTime)); c.stop(); } }

HBase and Sharding

The basic unit of Sharding(scalability and load balancing) is region,each region consist of a sequence of row keys, the regions are created and merged depending upon the configurable parameter

The table is initially created with a single region since the Hbase itself can not know in advance where to create the split points when the size of region grows more than what is configured it is auto-split  into different regions from middle of the row key.(Is is again Configurable)

Whenever data in the HBase table is updated it is first written in the WAL(Write Ahead Log)File and stored in memstore(memory store) once the data in memory exceeds the limit the data is flushed into the HFile on disk,As the store file increases the region server compacts them into combined larger files.As the store file accumulates or after the compaction finishes a Region-Split request is queued to decide if Splitting the Region is required ,Since the HFiles are immutable the newly created region if created will not rewrite the data Instead, they will create  small sym-link like files, named Reference files, which point to either top or bottom part of the parent store file according to the split point.

Thus Splitting is very fast as the newly created region will read from the original storage files until a compaction rewrites them all into separate one asynchronously

For further Reading I will suggest to go through the below link

http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

Horizontal v/s Vertical Scaling

Horizontal Scaling means you are adding more nodes(m/c ‘s )to your cluster in order to support more load and increase performance while Vertical Scaling is increasing power of a single m/c in order to increase performance or to maintain it with increasing load by increasing (RAM,CPU)

In DB world Horizontal scaling refers to Data Sharding where the data is split into small partitions with each partition(Shard) lying in different nodes in a cluster whereas in  Vertical scaling data resides on a single node and scaling is achieved by multiple cores

Hbase v/s cloumn-oriented Databases

HBase is  not a a cloumn-oriented database in a typical RDBMS sense,but utilizes an on-disk column storage format
HBase excels at providing a key-based access to a specific cell of data,or sequential range of cells while the column-oriented databases excel at providing real time analytical access to data

Storage Models

Column-Oriented

  • Data in columnar model is kept in column
  • Data in columns are almost homogeneous thus fits for compression
  • Aggregation functions are very fast since entire column can be fetched very quickly
  • Inserts,updates are significantly slower and because of this reason the column-oriented databases suits best for OLAP scenario

Key-Value

  • Data is stored in a set of distributed maps
  • No bias towards aggregate or row-based processing performance and therefore no bias towards either OLAP or OLTP applicability.

A HBase  table can be represented using A Data Structure as

'rowkey1' => {
    'c:col1' => 'value1',
    'c:col2' => 'value2',
},
'rowkey2' => {
    'c:col1' => 'value10',
    'c:col3' => 'value3'
}

What is Database Sharding?

Sharding is the partitioning mechanism  of dividing a very large DB into small,faster, manageable parts called Shards  such that all these shards are independent of each other and shares nothing and thus can be distributed across different servers while enjoying all the benefits of horizontal scaling .

Sharding is just another name for “horizontal partitioning” of a database
Horizontal partitioning is a design principle whereby rows of a database table are held separately, rather than splitting by columns.Where each partition consist of some number of rows and forms a part of shard.

With time as the DB grows the time taken to query it increases exponentially sharding helps in scaling the DB horizontally to achieve the performance benefits.