**** Very Sick Company ****

First rule of very sick company; never talk about book club.

Getting F*ing Drill on Ambari

September 12, 2019


Me.

Drill is a fun query layer that is VERY easy to use but also NOT the easiest thing to set up in an Ambari environment.

In fact it took me quite a few tries and some very small changes to get to the end game.  The following worked in my specific environment, Ambari 2.65.

I hope it works for you, so give ‘er a go…

Login to your Ambari SSH client and sign in as root:

su root

First you (obviously) need to get Apache Drill…

wget http://archive.apache.org/dist/drill/drill-1.12.0/apache-drill-1.12.0.tar.gz

… this is NOT the latest version, but its the only version I got working, feel free to try out others, you can see the full list here:

http://archive.apache.org/dist/drill/

You know how these things go; nothing ever works with anything else.  You may have to experiment…

Once you’ve got this downloaded simply unpack it …

tar -xvf apache-drill-1.12.0.tar.gz

Then cd into the created directory and run the installer …

bin/drillbit.sh start -Ddrill.exec.http.port=8086

… however this is the rub; this port number (8086) is the one that worked for me.  This is not the one originally suggested to me.  I would suggest trying it first but if it doesn’t work don’t disregard the chance that the port number might be your only problem.  Google it.

If all of this actually works then you can open a browser, at your Ambari local http site, with that new port number, for example:

http://127.0.0.1:8086

This is now the pretty cool part because you’ll get a fairly decent looking UI for experimenting with Drill.

Drill is fast and simple; SQL queries that you can use on any database and even ACROSS different types of databases.  For example you can actually perform JOINS between your Drill data and HBase, or any other datasource you may have.

Lemme know if you get this working with different parameter settings/methods/etc… I’d love to add them to this post.


MongoDB on Ambari

August 29, 2019


Realistic.

Mongo does NOT come with Ambari.  And yes, it is  a pain in the ass to install.  Just trust me though, it DOES work.  It might make you wish you’d gone into the Trades instead of IT but it will work.

I’m using Oracle Virtual Box with HDP VirtualBox 2.65.  This is an old version however I’ve found it to be a hella less of a giant memory pig than any newer ones.  Feel free to experiment.  Especially if you have a better computer than the pile of crap I’m using.

So now just do what I say and don’t ask any questions my darlings…

Log into Ambari with your SSH client.  Make sure to be in root because you’ll be installing stuff:

su root

Then cd into the following directory…

cd /var/lib/ambari-server/resources/stacks/HDP/2.65 (or your version of choice)/services

Now you can grab the MongoDB adapter that this guy so kindly built for us…

git clone https://github.com/nikunjness/mongo-ambari.git

Now restart your Ambari service:

sudo service ambari restart

… in fact you should tattoo the above command on the back of your wrist, because if anything is EVER going wrong with your Ambari service first just try this.  It’s the fancy equivilent of unplugging your computer then plugging it back in again.


You Have to Get Ambari Installed Locally or Just Kill Yourself

August 22, 2019



When I started in Big Data I figured that using AWS was a great idea.  Hell, it was free.

And then you go to bed only to remember in the morning that you left 6 clusters running because you’re an idiot.  Turns out that those mistakes will suck the juice out of your credit card faster than an  Omega Compact CNC80.

Like days past (PHP, MySql, ect… ) you need to do this stuff on your crappy laptop before you can step into the big leagues.

Lucky for us all it’s possible to do now, and only a little bit of a pain in the ass.

If you’re on a Mac then you’re on your own.  I can’t afford a precious little mac so get a $300 Dell and do the following ….

First you need a little Oracle tech.  It’s their VirtualBox that you can grab for free, the only trick is that it’s huge:

Click the big green button and go away on vacation then once you’re back install whatever it is that you got.

On it’s own, this VirtualBox is virtually useless until you add the HortonWorks Sandbox (HDP).  This is also free, however it’s a monster.  So grab it here:

Downloads

… then go on vacation for a few weeks while it downloads.  Once it’s done you’ll need to run the VirtualBox, then select File/Import Appliance and pick the sandbox you downloaded.

One huge caveat; there are many versions of HortonWorks.  I would suggest getting a handful of different versions from the Archives.  My eventual goto version of choice was 2.65.  It seems a tad lighter than the newer ones, but feel free to experiment.  I’ve got at least 4 versions loaded to go, and you may have better luck with the 3.* series than I did.

Once it’s loaded up you just need to Start your SandBox and follow the instructions to get a vidual Hadoop UI.

You will definitely also require a command line into it, so make sure to get Putty or some other SSH client.

The UI opening screen will tell you what address to SSH into, take note of this and login to it.  It’s probably something like…

Host: maria_dev@127.0.0.1

Port: 2222

The default user is maria_dev (same password).  But at some point you’ll need to login as an admin so just get it over with now:

su root

The default admin login is ‘admin’ and ‘hadoop’.  Change this in your SSH session first:

ambari-admin-password-reset

You have to restart everyting all the time.  It is time consuming but not difficult.

ambari-agent restart

For the record, the above command is the most useful thing in the whole repetoir.  Whenever ANYTHING goes wrong with Ambari pop this into your console and see if it fixes it.


HBase and Pig and Titanic

August 14, 2019


Since NoSQL is the future of humanity and will save the Universe, I’ve thrown together this quick tutorial on how to use it in a (semi) practical sense.

I’ve used Ambari, locally, to run this experiment.  Although I can’t give a full tutorial on Ambari or Hortonworks, I will provide the following links.  You’ll need to download two files (one giant), and there’s plenty of great documentation for installing and using them:

Hortonworks Data Flatform

https://hortonworks.com/downloads/

Oracle VirtualBox (the latest version is 6.0, however I had some problems with this and have reverted down to 4.5)

https://hortonworks.com/downloads/

For the sake of simplicity I’m using the Titanic data set (the train.csv file) which you can get from Kaggle:

https://www.kaggle.com/c/titanic

The first thing you’ll want to do is upload this dataset into your HDFS files in Ambari (go to HDFS and files view).  I’ve put mine in a ‘titanic’ directory. You can do this with the command line too, I found it easier just to do the dashboard thing for such a relatively small file.

You’ll need to SSH into your local Ambari, being on Windows I’m using Putty.

Once you have a nice connection, you can start checking out your HBase situation.  To get to the shell just type:

hbase shell

At the HBase prompt you can try a couple of things.  First just type …

list

… to see a list of current tables.  Ambari automatically installs a few examples for you, but we’ll need to make a new one for our Titanic data.  So just type …

create ‘titanic’, ‘passengers’

… which creates a new table called ‘titanic’ with a column family called ‘passengers’.  If you’re not sure what a column family is then you might want to do a bit of research on hBase and NoSQL in general.  It’s not very difficult, but some background will help when you take a look at the final product.

Now for fun, type…

scan ‘titanic’

… which should show you a new table with zero rows.

Now type …

exit

… to exit from the HBase shell and get back into your normal Linux prompt.  You’re going to need to get a Pig script into this location. The Pig file is as follows.

A = LOAD '/user/maria_dev/titanic/train.csv'
USING PigStorage(',')
AS (PassengerId:int, Survived:boolean, Pclass:int, Name:chararray, Sex:chararray, Age:int, SibSp:int, Parch:int, Ticket:chararray, Fare:float, Cabin:chararray, Embarked:chararray);
users = FILTER A by $3 != 'Name';
DUMP A;
DESCRIBE users;
DUMP users;
STORE users INTO 'hbase://titanic'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'passengers:Survived, passengers:Pclass, passengers:Name, passengers:Sex, passengers:Age, passengers:SibSp, passengers:Parch, passengers:Ticket, passengers:Fare, passengers:Cabin, passengers:Embarked');

For convenience sake I’ve uploaded it to my server so you can get the file into your Ambari by typing the following (you lazy bugger) …

wget http://www.matthewhughes.ca/titanic.pig

Now you SHOULD be ready to go!  Simply type …

pig titanic.pig

… and watch the magic happen.  It can take a while, so go get a coffee.

Once it’s done (successfully we hope), go back into the hBase shell, and scan your titanic table (as per instructions above).  You’re titanic data is now in hBase!  (or HBase, or hbase, or HBaSe, who knows…)


Some Useful (and Simple) PySpark Functions

August 2, 2019



I’ve been to Spark and back.  But I did leave some of my soul.

According to Apache, Spark was developed to “write applications quickly in Java, Scala, Python, R, and SQL”

And I’m sure it’s true.  Or at least I’m sure their intentions were noble.

I’m not talking about Scala yet, or Java, those are whole other language.  I’m talking about Spark with python. Or PySpark, as the Olgivy inspired geniuses at Apache marketing call it.

The learning curve is not easy my pretties, but luckily for you, I’ve managed to sort out some of the basic ecosystem and how it all operates.  Brevity is my goal.

This doesn’t include MLib, or GraphX, or streaming, just the basics

Show pairwise frequency of categorical data

train.crosstab('matchType', 'headshotKills').show()

This exports something like this:

+-----------------------+----+----+---+---+---+---+---+---+
|matchType_headshotKills| 0| 1| 2| 3| 4| 5| 6| 8|
+-----------------------+----+----+---+---+---+---+---+---+
| duo-fpp|3762| 608|127| 31| 7| 6| 0| 1|
| solo-fpp|1955| 331| 77| 28| 6| 2| 2| 0|
| normal-duo-fpp| 19| 4| 1| 0| 0| 0| 0| 0|
| crashtpp| 1| 0| 0| 0| 0| 0| 0| 0|
| squad-fpp|6547|1032|216| 56| 14| 4| 1| 0|
| crashfpp| 35| 1| 0| 0| 0| 0| 0| 0|
| normal-squad-fpp| 50| 9| 1| 1| 4| 1| 2| 0|
| normal-solo-fpp| 4| 1| 3| 0| 1| 0| 0| 0|
| squad|2397| 345| 70| 24| 5| 2| 1| 0|
| flarefpp| 5| 0| 0| 0| 0| 0| 0| 0|
| solo| 644| 98| 13| 4| 4| 0| 0| 0|
| normal-duo| 0| 0| 0| 0| 1| 0| 0| 0|
| duo|1198| 159| 55| 9| 3| 2| 0| 0|
| flaretpp| 6| 1| 1| 0| 0| 0| 0| 0|
| normal-squad| 1| 1| 0| 0| 0| 0| 0| 0|
+-----------------------+----+----+---+---+---+---+---+---+

Returns a dataframe with all duplicate rows removed

train.select('matchType','headshotKills').dropDuplicates().show()

Drop any NA rows

train.dropna().count()

Fill NAs with a constant value

train.fillna(-1)

A very simple filter

train2 = train.filter(train.headshotKills > 1)

Get the Mean of a Category

train.groupby('matchType').agg({'kills': 'mean'}).show()

Get a count of distinct categories in a Column

train.groupby('matchType').count().show()

Get a 20% sample of a dataframe

t1 = train.sample(False, 0.2, 42)

Create a tuple set from Columns.  Note that dataframes do NOT support mapping functionality, so you have to explicitly convert it to an RDD first (it’s in the .rdd call below)

train.select('matchType').rdd.map(lambda x:(x,1)).take(5)

Order by a Column

train.orderBy(train.matchType.desc()).show(5)

Add a new Column based on the calculation of another Column

train.withColumn('boosts_new', train.boosts /2.0).select('boosts','boosts_new').show(50)

Drop a Column

train.drop('boosts').columns

Using SQL

train.registerAsTable('train_table')
# sqlContext.sql('select Product_ID from train_table').show(5)

That’s it for now!


How to Start a New PySpark Job

July 24, 2019



I’ve been to Spark and back.  But I did leave some of my soul.

According to Apache, Spark was developed to “write applications quickly in Java, Scala, Python, R, and SQL”

And I’m sure it’s true.  Or at least I’m sure their intentions were noble.

I’m not talking about Scala yet, or Java, those are whole other language.  I’m talking about Spark with python. Or PySpark, as the Olgivy inspired geniuses at Apache marketing call it.

The learning curve is not easy my pretties, but luckily for you, I’ve managed to sort out some of the basic ecosystem and how it all operates.  Brevity is my goal.

This doesn’t include MLib, or GraphX, or streaming, just the basics

Import some data

train = sqlContext.read.option("header", "true")\
.option("inferSchema", "true")\
.format("csv")\
.load("train_V2.csv")\
.limit(20000)

Show the head of a dataframe

train.head(5)

List the columns and their value types

train.printSchema()

Show a number of rows in a better format

train.show(2,truncate= True)

Count the number of rows

train.count()

List column names

train.columns

Show mean, medium, st, etc…

train.describe().show()

Show mean, medium, st, etc… of just one column

train.describe('kills').show()

Show only certain columns

train.select('kills','headshotKills').show(5)

Get the distinct values of a column

train.select('boosts').distinct().count()
train.select('boosts').distinct()

That’s it for now…


Quick Correlation Plot with Seaborn

July 6, 2019


Correlation is the simplest way to start comparing features to see which data points may line up with other data points.

It’s fairly easy to get a quick visualization with the Pandas corr() function and a fancy Seaborn plot.

The only prerequisite is that you need to make sure all the data points in your set are numerical, either by default, or design, or elimination.  Once this has been accomplished, simply call the corr() function on your data set:

corr = df.corr()

Then you can plot it.  Feel free to change the aesthetic defaults I’ve included here:

plt.figure(figsize=(9,7))
sns.heatmap(
  corr,
  xticklabels=corr.columns.values,
  yticklabels=corr.columns.values,
  linecolor='white',
  linewidths=0.1,
  cmap="RdBu"
)
plt.show()

And you’ll end up with a fancy looking plot that should resemble this:


Class Imbalance

June 26, 2019


This is an important concept when performing any kind of predictive analysis.  All it means is that it’s imperative that the variable you are attempting to predict has decent balance between binary values.

So if you’re attempting to predict, let’s say, cancer, your data must have a fair balance between positive cancer results and negative results.  If your data has 10 positive results and a million negatives, you will probably not be able to form a useful algorithm.

Luckily, I found this little function that will go through your data and give you the balance in your data.

def print_dx_perc(data_frame, col):
   dx_vals = data_frame[col].value_counts()
   dx_vals = dx_vals.reset_index()
   f = lambda x, y: 100 * (x / sum(y))
   for i in range(0, len(dx)):
      print('{0} accounts for {1:.2f}% of the diagnosis class'.format(dx[i], f(dx_vals[col].iloc[i],
         dx_vals[col])))

print_dx_perc(breast_cancer, 'diagnosis')

Avocados! … and Plotly and DASH

June 5, 2019



Hello my pretties…

I discovered Plotly and DASH, so here is my first attempt.

https://mattocado.herokuapp.com/

As a colleague pointed out, ‘holy over-plotting Batman!’, which is totally correct.  I figured I would throw the kitchen sink at it and see what would happen.  Not sure it’s of any practical use, but at least I know how it works now.

https://www.kaggle.com/mattdata72/avocados-with-plotly-and-dash

Also thanks to that same colleague for introducing me to the whole architecture in the first place.  It’s very cool stuff, and hopefully I will post some technical details soon!


World War 2 Data Set

May 24, 2019


World War 2 Snow
Cold WW2

Since I’m a total WW2 nerd, and obviously a data one too, I found this great dataset that lists all of the weather conditions during the conflict and where it happened.

I’m not sure what grander plans I have for it yet, so I started with the obvious and built this script which cleans up the VERY old dataset:

https://www.kaggle.com/mattdata72/matt-ww2-cleanup

Have fun and I hope someone can find use from it!