Get Rapid with RAPIDS

RAPIDS is a collection of open source libraries to write, deploy and manage data pipelines end-to-end on GPUs. It uses  NVIDIA CUDA® for optimizing compute resources, but exposes parallelism through well known Python interfaces.

The Focus of this post is not to share the details for RAPIDS  but to detail steps to get started with it without many difficulties. The RAPIDS team has done a great job in compiling the Startup Guide. But, if you are someone like me who is very new to the world of GPU’s but got some decent experience in designing data pipelines then this post will help you very much in getting up and running using AWS platform .

These are the prerequisites mentioned on the Startup Guide.

Container Host Prerequisites

  • NVIDIA Pascal™ GPU architecture or better
  • CUDA 9.2 or 10.0 compatible nvidia driver
  • Ubuntu 16.04 or 18.04
  • Docker CE v18+
  • nvidia-docker v2+

Well,  I was not aware of most of these requirements and what they mean when I first started working on RAPIDS . To make it easy I have compiled these below steps –
Note:- You need to have AWS account in place for this, GCP and other cloud providers also provides support for machines required for RAPIDS but for the sake of this post I have selected AWS

Step 1) In AWS console launch an instance with ami Deep Learning Base AMI (Amazon Linux) Version 16.2 (ami-038f5aa6f8673b785) .

Step2) Once you have selected the AMI, make sure to choose GPU instances in filter-by, else you won’t be able to run the docker image of RAPIDS as the NVIDIA CUDA®  framework requires GPU in place

Step3) Once the instance is up and running

      •  Install docker
      •  once docker is installed you need to download the image for RAPIDS  from docker repository, there are many other places where you can find the image , follow Startup Guide for more details on this. Run below command to download RAPIDS  image
      • once the image is downloaded run below command to start RAPIDS container

        Note the port mentioned in this command it is required that the jupyter gets started on of the mentioned port else you won’t be able to access the notebook and will get error

        It is mentioned in Startup Guide that the above command will start jupyter but it was not the case with me . I had to start the jupyter separately. If, this happens with you too use below command to start the jupyter


And YA!! you got the jupyter notebook running which using RAPIDS to perform ETL and many other transformations. Follow Startup Guide for all ETL operations as the intent for this post was just to get RAPIDS  up and running using docker image.

For reference here is the cheat sheet for RAPIDS

Logistic Regression – PART 1

One of the most commonly used Binary Classification technique. Given (m) training examples you want to classify these training examples into 2 groups.
Let’s  consider a problem statement where given a set of images we want to know if this is a Dog ( Probability – 1) or not (Probability – 0 )

These images in a computer are represented by pixel density based on the color pattern. This pixel vector is the Feature vector( A vector that represents the important characteristics of an object) for the images.

Given an Image ( I ) will have a feature vector (n) of dimension

Row = Number of pixel across Red, Green, Blue band
Column =1


Each image in our data set will have the corresponding vector related to three colors Red, Green and Blue, we can stack them together to produce one big column vector with all values

Each value in this vector is independent of other but is part of single observation i.e in our case part of one single image. Like this we will have a number of vectors = number of images, with the dimension of the vector as [number of Pixels X 1], if we represent the number of Pixels as nx., then the dimension of the vector becomes 1 X nx

To determine if the given image is of a Dog or not the only way is to look at the characteristics of every input image we have in training set since these images are nothing but the vector of pixels which means using these pixels we need to determine if the given image is of a dog!


Imagining Logistic regression as the simplest Neural Network

We can imagine logistic regression as a one-layer neural network also known as a shallow neural network.
We will have inputs (x1,x2,……xm) where M = number of training examples we have. With every input in our case an image to be more precise a feature vector corresponding to that image, we will have a weight associated with every input.

Where Ŷ is the probability of the feature being of a dog picture feature i.e probability of Picture being of a Dog.

ŷ= P (Y = 1/X)

Some Function = b0+ w1*x1 + w2*x2 + …..+ wm*xm

Need for constant in Logistic Regression b0

With regression we want to approximate a function that defines a relationship between X and Y ( Input -> to -> Output) to get this working we need a bias term so that we can predict accurate values i.e if your input values are zero then the predicted value would also have to be zero Adding a bias weight that does not depend on any of the features allows the hyperplane described by your learned weights to more easily fit data that doesn’t pass through the origin

W€Rnx   Where R is some real number vector of dimension nx

b€R  Where R is some real number vector

Question is to predict ŷ using w,b given inputs  x1,x2,……xm

Since the output of a logistic regression function can only be 0 or 1. The output function would be a Sigmoid function defined as

ŷ =  σ ( wTx+b )
z = wTx+b
ŷ =  σ ( z )

σ ( z ) =  1 / 1 + e -z

Case1 = When Z is very small, in that case, e -z will be some big number  giving value for σ ( z ) ~0
Case2 = When Z is very large , in that case e -z will be ~ 0 giving value for σ ( z ) ~ 1

So, in logistic regression, the task is to learn w,b so that ŷ becomes a good estimate of y.

In the next blog post, we will discuss how to learn the values for w,b

Handling dangling Elasticsearch watcher index.

A few weeks back, our Elasticsearch cluster stopped executing any watchers. Doing initial analysis it looked like there is some problem with AWS SMTP service. As we use AWS SMTP for sending mail alerts to our LDAP accounts. After going through more logs and spending some time in understanding the sent mail statistics on AWS, thanks to AWS for providing intuitive UI to get insights of emails that are getting rejected. We were sure there is no problem with sending of email but something is wrong on the current master. Analyzing below log line it was clear that there is some issue with .watcher index.


  • Delete the local directory: The log line tells the node name that is holding a stale copy of index along with the directory name. In our case it was es-master-1 node name with the directory 23nm9NSrSkeZaK4Dtyughg under data folder for the master.
  • Restart Watcher Service: Once the stale index directory is deleted, restart the watcher service

    POST _xpack/watcher/_restart

Lessons learned hard way with Elastic Upgrade

In this blog post i want to share my experience and lessons learned while upgrading our ElasticSearch cluster from 5.4 to 5.6 with TB’s of data without any downtime.
To start, i would first like to introduce the play ground and rules that we were required to take care while performing cluster wide upgrade.

  • Indexing and Searching should not get impacted, at time of upgrade we were handling indexing @=~ 20k/sec and search rate @=~ 2500/sec
  • Watcher/alerts should always be running as they provide us valuable actionable insight to TB’s of data we host
  • Security of cluster should not be compramised
  • Plugins/visualizations and dashboards are all expected to function as they were before upgrade

As it was a minor/ rolling upgrade we started with one node at a time things were simple when we moved from one DataNode to other DataNode keeping check of below points

  • Stop cluster wide allocation of shards. As when we stop one of the nodes for upgrading the ES version, Master will mark some primary and replica shards as missing this will lead to assignment of shards, but we don’t want this to happen. Node is going to join back the cluster in some time.

    PUT _cluster/settings
    "persistent": {
    "cluster.routing.allocation.enable": "none"
  • Stopping a node will bring the cluster in Yellow state. Wait for the cluster to change color to green before jumping to next node. Each time you bring the node back after upgrading ES version, set the allocation to true.

    PUT _cluster/settings
    "persistent": {
    "cluster.routing.allocation.enable": "all"
  • There is always a possibility that even after setting the allocation true . ES will take lot of time to allocate shards and turn the status to green. If that is the case look for the setting

    . You may require to alter this setting to enable ES assign shards
  • One of the pain points while doing rolling upgrade with ES is , for some time till the upgrade is over you will have nodes running multiple version of x-pack plugin, this will make it impossible to track the progress from Kibana. Your best bet is to rely on Rest API end points. For instance let’s assume a cluster with 2 client nodes on which kibana is running and rest other 5 nodes acting as datanode and masternode. If we upgrade our datanode1. It’s X-pack plugin version will differ from the client node, as client node is not yet upgraded and even kibana is not yet upgraded. This will prevent the datanode1 to send monitoring data points to kibana and thus making it impossible to visualize progress on kibana UI 
  • One natural instinct is to upgrade kibana first and rest of the nodes to avoid any problem of monitoring, but it’s is a vicious circle, which is obvious. Kibana 5.6 will not start until all the nodes are at 5.6. Kibana 5.4 can work with some nodes at 5.4 and some at 5.6 but it won’t be able to display monitoring for all the nodes that are at 5.6
  • Another pain point is watcher. Watcher runs on Master node. Always,i would repeat always remember to upgrade the master nodes at end. If you upgrade the master nodes first while your some of the data nodes are running on 5.4 your watchers will fail. Even they will keep failing until you have upgraded kibana.
  • As the watchers uses .monitoring indexes ,all the watchers will fail even if you have upgraded all your client,data and master nodes until and unless kibana is also not upgraded.

Demystifying ElasticSearch refresh !

By default ElasticSearch is configured to refresh the index every 1 second. This means it will take atleast 1 second to propagate the changes that are made to a document to be made visible during search.But what if we have a requirement to trigger a process only when the search results are made available.

We want index/update or insert request to wait , until the changes made to the documents are available for search before it returns.

refresh parameter is available for these API’s to control when we want our index to refresh and changes made available to user.

  • Setting it to true , refresh = true will cause relevant Primary and Secondary shard , not complete index to be refreshed immediately.
  • Setting it to wait_for, refresh=wait_for will cause the request to wait until the index is refreshed by ElasticSearch based on index.refresh-interval i.e 1 sec by default. Once the index is refreshed the request returns.
  • Setting it to false, refresh=false has no impact on refresh and request returns immediately. It simply means the data will be available in near future.

Note: ElasticSearch will refresh only those shards that have changed, not the entire index

But there is catch in these simple parameters, there are cases that will cause refresh to happen irrespective of the value of refresh parameter you have set

  • if index.max_refresh_listeners which defaults to 1000 is reached. refresh=wait_for will cause the relevant shard to be refreshed immediately.
  • By default GET is realtime i.e each time a GET request is made, it issues index refresh for the appropriate segment. Causing all changes to be made available.