Spark updateStateByKey Explained!

Spark streaming is one of the great component of spark engine as it applies the computations to all the micro batches that gets created based on batch interval on the fly and also helps in combining our streaming results with the archive result database.

Spark stream introduces concept of DStream(Discretized stream) a Dstream is nothing but streams of RDD where one RDD holds one time of data i.e 1 batch=1 RDD, One of the powerful function exposed by DStream is updateStateByKey that helps us maintain state ,

Function specified in updateStateByKey is applied to all the existing keys in every batch, regardless of whether they have a new data or not, if the update function return null, the key value will be eliminated

The function updates the prevoious state of all the key with the new value if present for example


Batch 1 = 1,1,1 are the values we get in Streaming engine after reducing these by key we will get (1,3) as output
Batch 2 = 2,3 are the values we get in streaming engine output (1,3)(2,1)(3,1)
Batch 3 =1 streaming engine output (1,4)(2,1)(3,1)

def updateFunction(newValues: Seq[(Int)], runningCount: Option[(Int)]): Option[(Int)] = {

var result: Option[(Int)] = null
if(newValues.isEmpty){ //check if the key is present in new batch if not then return the old values
newValues.foreach { x => {// if we have keys in new batch ,iterate over them and add it
result=Some(x)// if no previous value return the new one
result=Some(x+runningCount.get) // update and return the value
} }

// call it as
val reducedRDD=keyValuelines.reduceByKey((x,y)=>(x+y))
val updatedRdd= reducedRDD.updateStateByKey(updateFunction)

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">