Sunday, January 15, 2017

The quick and dirty on getting a Hadoop cluster up and running

The last time I tested out a hadoop cluster, it was about four years ago. At the time, the setup was manual and I used two machines to set up a two node cluster. I was able to run map reduce jobs and test the system out. Fast forward to 2017 - there are a lot more animals in the Hadoop circus and cloudera has made it very convenient by having various options to test - a quickstart VM, docker images etc. Still getting the system up and running and executing the first sqoop job took some effort. I have documented what I did including the tweaks so anyone running into the same can get help. Here are the steps:

  1. Get a system ready - at least 16GB or RAM. I had a Linux Mint box on hand that I used. Mint is generally similar to Ubuntu but you do have to watch out for version specific instructions. 
  2. The next step is to install Docker. For Linux Mint 18, the steps here from Simon Hardy came really handy.
  3. Cloudera provides a Docker Quickstart container - do not use that. There is a lot of documentation and links on that and its quite easy to go down that path. A better option is to use the Cloudera Clusterdock. Clusterdock is a multi-node cluster deployment on the same Docker host (by default it does a two node). The clusterdock documentation is very useful but there are a few catches that I will note here. There are few other links on clusterdock here:
    1. The cloudera online tutorial is based on the quickstart docker container or the quickstart VM. There are several dependencies including a mysql database, flume files that are used in the demo etc. I would suggest that you keep that container also around for a bit and copy the data as needed to the clusterdock nodes. The clusterdock setup is much more stable in the cloudera manager and the slight inconvenience (occasional hardware freezing) may be worth it. After launching the quickstart container, you may simply tar the /opt/examples folder and the mysql retail_db database and transfer that to the host machine using the docker cp command and then kill the quickstart container.
    2. The clusterdock.sh script on the cloudera website lacks 'sudo' in a couple of places. Be aware of that. Its easy to spot that in case it causes a problem. For example the ssh function has this problem.
    3. I wanted to run the sqoop command and for that I needed a myql database to connect to. There was one I installed on the host machine. It was almost impossible or looked very time consuming to connect the clusterdock containers to the host mysql. You run into a docker networking issue. The easy way out is to install a mysql docker container and put it on the same user defined network that the clusterdock nodes use. Note that you have to force the IP and the network on the mysql container to do that and also map the mysql port 3306 so its open for access. Else you will waste a lot of time!
  4. Next step is to ssh on the master node and run a sqoop job. At this time you will run into a lot of permissions issues if you are not careful with where you are storing the target imported files. sqoop will generally report exceptions stack trace with these permissions errors. Best is to google and fix the any paths you give to the sqoop command.
  5. You also need to copy the mysql jar file in the sqoop lib folders. Easiest way is to get it on the host box and then use the docker cp commands to move it to the desired location.
  6. That should do it - get you past the sqoop step and then you can run a query in Hue.

Saturday, January 14, 2017

Analyzing Gapminder Data


Founded in Stockholm by Ola Rosling, Anna Rosling Rönnlund and Hans Rosling, GapMinder is a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals. It seeks to increase the use and understanding of statistics about social, economic, and environmental development at local, national, and global levels. Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas.

GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

Decision tree analysis was performed (using python) to test nonlinear relationships among a series of explanatory variables (0-13 below) and a binary, categorical response variable (above average suicides per 100,000 in 2005). Note that python does not support pruning. The code is included below the analysis.




IndexVariable Description
X[0]income per person 2010 Gross Domestic Product per capita in constant 2000 US$
X[1]alcohol consumption 2008 alcohol consumption per adult (age 15+), 
X[2]armed forces rate Armed forces personnel (% of total labor force) 
X[3]breast cancer per 100th 2002 breast cancer new cases per 100,000 female
X[4]co2 emissions 2006 cumulative CO2 emission (metric tons)
X[5]female employment rate 2007 female employees age 15+ (% of population)
X[6]HIV rate 2009 estimated HIV Prevalence % - (Ages 15-49)
X[7]internet use rate 2010 Internet users (per 100 people)
X[8]oil per person 2010 oil Consumption per capita (tonnes per year and person)
X[9]polity score 2009 Democracy score (Polity)
X[10]residential electricity consumption per person 2008 residential electricity consumption, per person (kWh)
X[11]employment rate 2007 total employees age 15+ (% of population)
X[12]urban rate 2008 urban population (% of total)
responsesuicide per 100th Is the 2005 Suicide, age adjusted, per 100 000above average? (True or False)


Alcohol consumption was the first variable to separate the sample into two subgroups. The threshold was identified to be 16.19 litres and countries where it was higher than that had a likelihood of above average suicide rate. Next one was the amount of electricity consumed in residences. If the consumption was below 230 kwh, there was a likelihood of higher suicide rates. This was followed by armed forces rates - if less than 41% of the population was part of armed forces, the likelihood was higher. Alcohol consumption and employment rate comes next - higher amounts of alcohol consumption (higher than 14%) and a low employment rate (lower than 49%) had the likelihood of an above average suicide rate. Countries with low alcohol consumption (14%) and low female employment rates (less than 56%) had lower likelihood of a below average suicide rate. However, if this is combined with low internet use (lower than 85%) this led to likelihood of above average suicide rates.

The total model classified 64% of the sample correctly.


Thursday, June 6, 2013

Dumping Tomcat packets using tshark

Tshark (command line version of wireshark) is a wonderful tool for dumping packets and recently I used it on my Mac since I couldn't easily get Tomcat to log the HTTP packets coming in on port 8080. Having used it in the past for lots of other reasons, I felt compelled to find a generic solution to this problem where you have to rely on application level logging to determine why something works or doesn't.

Here is the command I used (lo0 is the loopback interface since I was running the client and server on my PC)

tshark -f "tcp port 8080" -i lo0 -V. Here is a very good page on tshark that I am sure I will come back to again and again to get more juice out of this tool.

Getting Postman to test Restlet

Postman is a wonderful Chrome application for testing REST method calls. However it was a bit hard to get it to work for testing Rest calls with my Restlet server. Here are the gotchas I faced and havent yet figured them out completely. I love to record them so I can come back much later and they are still here ;)

1. Postman does not add the Content-Type header by itself. If you select an HTTP method which allows a body (e.g POST), it allows you to create the body and select the format (JSON/XML etc) but you must remember to add the Content-Type header.
2. If your application requires authentication, you can add that too. Postman supports basic, digest as well as OAuth (which I would love to test out next).
3. The biggest problem I ran into was when I sent XML or JSON body and the Restlet server replied back with 415 - Unsupported Media Type. The request does not even hit my application side! If you write a client application and choose the Media Type of MediaType.APPLICATION_XML and the server side method is annotated by @Post("txt:xml") it works. However, when you set the Content-Type header in Postman to application/xml it does not work. To debug this further, I installed wireshark and dumped the packet contents. I was surprised to find that the client built using Restlet framework was sending a Content-Type header as text/plain. This had to be some issue on my end. Interestingly, if I made the corresponding changes in Postman, the request coming out from that application also started to work. These are the two headers I inserted. Note that a Content-Type of text/plain still does not work. You must indicate the charset as well to make it work.

Content-Type: text/plain; charset=UTF-8
Accept: application/xml


Friday, May 24, 2013

Comparing Oracle NoSQL and MongoDB features

I recently had a chance to go through Oracle documentation on its NoSQL solution in an effort to evaluate it against MongoDB. The below comments are based on Oracle's 11g Release 2 document (4/23/2013) while Mongo is at its Release 2.4.3.

Oracle's deployment architecture for its NoSQL solution bears strong resemblance to MongoDB with some restrictions which I feel will go away in subsequent releases. The whole system is deployed using a pre-decided number of replica sets where each replica set may have a master and several slaves. The durability and consistency model also seems similar to what MongoDB offers, although the explanation  and control seemed a lot more complex. One of the restrictions is that the user must decide how many partitions the whole system will have in the beginning itself and these partitions are allocated amongst the shards. Mongo's concept of "chunks" seemed a lot similar but easier to use and understand. 

One of the biggest issues is security - the system has almost no security at user access level and there is no command line shell to interact with it. The only API which can possibly be used is Java. This is clearly not developer friendly right now. 

Perhaps the most confusing was the concept of JSON schemas which were used to save and retrieve the values in the key value database. Every time you save and retrieve a value from the database you have to specify a schema to serialize and de-serialize the data and these schemas may have different versions that you need to track. There could be multiple schemas concurrently being used (e.g each table could be using one schema). What was confusing about it was why Oracle took this approach and even if they did why was this not hidden under the hood so users dont have the deal with it. The boilerplate code which has to be written to constantly handle this simply looks unreadable and no developer would find it fun to use this system.

I also noticed the absence of ad-hoc indexing in the system, something I have begun to appreciate in Mongo now.

Another odd feature was that when an update was made to a set of records which has the same "major" key, the update could be made as a transaction. This was because a records which had the same "major key" always went to the same partiton which ends up on a single replica set and hence on one physical node which is acting as a master. This is one more thing which must be carefully considered by the developer before the schema is designed.

I found Mongo to be much more mature in terms of using true flexible schema - the user sees records as full JSON documents where each record in the same table can potentially have different fields. In Oracle, if a field is missing and its in the schema it must have a default value (very much like relational).

What seemed to be well thought through was the deployment architecture, durability, consistency model and how it was controlled by the user on a per operation basis. I am not sure of any big deployments using Oracle NoSQL yet so it would be good to hear if there are any. I am also expecting a big revamp in the next release from Oracle so this is easier to use from a development standpoint.