In this post I will share the steps , design principles that I have taken to handle time series data in HBase.
Real Problem
In HBase since the keys are Lexicographically sorted ,when we try to insert time-series data all the keys tend to fall in same region served by a single Region Server leading to generation of a Hot-Spot for the time till the region splits into two and during this time we are not utilizing the entire cluster capacity ( as one Region is only served by a single Region Server — no load distribution takes place all keys falls onto same Region/Region Server)
To avoid this problem I made use of the design principles of OpenTSDB
Lets create 5 buckets now the keys will fall in one of these buckets
Original_Key%Bucket+OriginalKey=NewKey
Like if we want to store data from
7-July-3:00 pm 1436261508
7-July-3:01 pm 1436261509
now lets use our formulae
Original Key=1436261508
Bucket=5
NewKey=1436261508 % 5 + 1436261508 = 3 + 1436261508 = 1436261511
when Original Key=1436261509
NewKey=1436261509 % 5 + 1436261509 = 4 +1436261509 = 1436261513
when user will ask for data from T1 to T2
our T1 = T1 % Nu. Of Buckets + T1
our T2 = T2 % Nu. Of Buckets + T2
By this we made sure that even while storing the Time-Series data all keys never ends up on a single Region/Region Server
but we lost our original key in order to preserve it we add the original key separated by some separator as suffix
This will solve HotRegion Server Problem but will lead to small rows (Less wide rows)
To deal with this problem that is to have wide rows we can have the key rounded up till minute granularity and then store the timestamp in seconds as the col qualifier
Minute —> Sec:Header –> Value