**** Very Sick Company ****

First rule of very sick company; never talk about book club.

Python List Comprehension with Plotly

May 10, 2019


This blog will assume you have the following skill-levels:

Python: Medium

Plotly: Basic

I always seem to be finding bizarre corners of Python that stretch my intellectual abilities way past their natural sense of safety.

Hot Chick Thinking About List Comprehension

This time, I’m going to attempt to explain TWO concepts mashed into one very complicated (but powerful) tool.  Amazingly, I couldn’t find this approach in any tutorial online.  Maybe I didn’t look hard enough, but either way, here’s my attempt.

First let’s have a quick introduction to Python List Comprehension.  The name itself is daunting, but the concept itself is less so.

List Comprehension is meant to replace a for-loop when creating new lists.  For example, a regular, straightforward way to break up a string into it’s constituent characters might go something like this:

myLetters = []
for letter in ‘giraffe’:
    myLetters.append(letter)
print(myLetters)

This would create a new list (myLetters) containing this:

['g','i','r','a','f','f','e']

List Comprehension can do this in a slightly ‘simpler’ way.  Example:

myLetters = [ letter for letter in ‘giraffe’]
print( myLetters )
['g','i','r','a','f','f','e']

It is philosophically similar to Python’s lamda operator.  You don’t HAVE to use it, but it can come in very handy. And here’s an example of when that’s true.

Plotly.  It’s amazing, but it can also be bloody difficult to maneuver.  Especially when you’re as stupid as me. And by the very fact that you’re reading MY blog, I have to assume you are too.

We’re going to assume that I’m analyzing a set of data that contains a series of temperature measurements.  The set includes the day of the week that each temperature was taken as well as the time of that day.

We want to plot the data with 7 different lines, each one representing the day of the week.  The x axis will be TIME, and the y axis will be TEMPERATURE.

Each day is represented many times in the data set, so we really want to split the data up by DAYS so that we can draw each line appropriately.

data = [{
'x': df['TIME'],
'y': df[df['DAY']==day]['AVERAGE_TEMP']
} for day in df['DAY'].unique()]

Let’s rip apart this bizarre tiny thing and make some sense of it…

First off, let’s break out the List Comprehension and ignore the guts for now

data = [day for day in df['DAY'].unique()]

In or example, this will simply create a list of all the unique day-names:

['TUESDAY', 'WEDNESDAY', 'THURSDAY', 'FRIDAY', 'SATURDAY', 'SUNDAY', 'MONDAY']

Now for the middle stuff.  I haven’t specified WHAT is in these data objects, but it doesn’t matter in order to get the concept:

'x': df['TIME'], 
'y': df[df['DAY']==day]['AVERAGE_TEMP']

This is essentially the first ‘day’ value in the List Comprehension.  For each unique ‘day’ value, we’re going to make an x point for the times represented, and a y point for each of the temperature values.  In the above example you’ll get something like the following.  Note, we get 7 coloured lines, each one representing a day in the week:


DZone Love

April 24, 2019


Wow, I’m chuffed.  After about 20 rejection letters DZone has just published FIVE of my articles:

Cleaning Data 102 – Pesky Texty

Where to Start with a New Data Problem

Cleaning Data 101 – Imputing NULLS

Coefficiently Confused

Cleaning Data – Supervised Styling

I may not be a Voltaire yet, but it’s a pretty nice honor.

DZone is a very useful (and obviously highly discriminating) site for anyone working with or learning Data Science.  I highly suggest checking it out.


5 Ways that Writing a Tech Blog can Make you Smarter

April 4, 2019


Tech Blog

Anyone who is reading this article has probably changed careers several times.  Sometimes because of evolutions in technology, sometimes because of industry changes, or sometimes just due to boredom or curiosity.  There are plenty of you that don’t even know you changed; remember Actionscript 2? You’re probably using Actionscript 3 now. That was a career change. A significant one. But you probably still call yourself a Flash developer.

Most people on this planet don’t have jobs that need neural updating every 5 minutes.  If you work as a production supervisor at a shipyard, or as a GP in a family practice, (or any of the other professions I wish to god I’d chosen instead), your job description and responsibilities will probably evolve at a generational rate.

When you work in tech you don’t live that dream.

Unfortunately I’ve never had a learning strategy.  You could probably put the entirety of my university notes on a single sided 4×6 index card.  Professionally it evolved somewhat; I would simply commit to a job that I knew nothing about then sweat myself to the finish line (a USB powered defibrillator humming next to me).

None of this is healthy.

Recently, I’ve found something much better.  It’s called blogging. You might have heard of it.  I can guarantee that this simple exercise can make you smarter.

Learning
Dumb Kid

1) Teaching Stuff is Hard

Explaining a concept to another human being is difficult.  It takes patience and a willingness to answer stupid questions with a smile.  Often those stupid questions turn out to be deep and brilliant insights.

Writing with an audience in mind does three things; it outs subtle details, highlights the things that you don’t completely understand yet, and reinforces the concepts that you do.

2) Built-in Editors and Fact Checkers

Some call them trolls, but if someone is willing to read your ramblings and then contradict or correct them don’t take it as an affront.  This is free editing my pretties. Do you know how much $$ a professional editor costs? Me neither, but it’s probably more than free. Trolls and nerdlings love to correct things, it makes them feel smart.  But at the end of the day you’re the one who will benefit.

3) Cheat Sheets in Your Own Words

I don’t trust myself as far as I can throw me, (I’m pretty chubby and weak)  but there are times in life when you have to trust your past self because it’s spent a considerable amount of time attempting to help your future self.

That is way too cerebral.

Reading an explanation of something in your own words can reinforce a concept in seconds.  Whether it’s well written or smeared on a wall in offal, they are your words and I guarantee that you will understand them.  In fact one of my most useful resources these days are my own blogs. Not because I’m amazing and self centred (although I am) but because I can immediately return a concept if I re-read my own explanation.

4) Finding Colleagues.

I’m not great at making friends.  I have, and have had, great friends, people who would move bodies for me.  Human ones.  But these are friends I’ve made during life, not over a coffee discussing logistic regression.  You need some nerdling friends too. You are one of them now and some of them are pretty cool.  If you have a blog you have an instant ice breaker; you’re both in the same predicament.

It also gives you street cred.  No matter how basic or advanced your writing is, the fact that you ARE writing provides a whiff of validity; you must know what you’re talking about.

These new friends will inevitably give you tips, because nerdlings love nothing more than telling someone else how smart they are.  And free tips should never be thumbed at.

5) Grammar

This may be the most important one.  Whether you’re new to English or have been speaking it your whole life, practicing writing is a must.  

People judge, so working on your basic grammar is a big deal.  I’ve laid sweeping judgements upon people for simple grammatical errors like ‘it’s’ instead of ‘its’ or ‘wear’ instead of ‘where’.  That judgement is not fair, I admit, but it’s the reality.

We all have habitually bad grammar habits and these tend to become much clearer when you re-read your posts.  Small errors pop out like red flags. This drives me interminably crazy because I’m constantly finding and correcting small mistakes.  But at the end of the day you’re going to be better off for it.

So go.  Now. Write.  No one will make fun of you.  If anything they’ll be extremely jealous that they don’t have the guts to do the same thing.  You are under zero journalistic ethics to get everything right or perfect. You have the technological advantage of an amendable medium.  You can go back and update/fix stuff whenever you want. If you wait for the moment when your article is 100% perfect you will end up with exactly zero blog posts.  There’s nothing wrong with modifying something when you have new information come in.

Posting your writing publicly will give you a sense of progress and make you realize that you are actually learning something!


Tension with TensorFlow

March 27, 2019


I’ve just started dabbling in TensorFlow, which is an open source library for building Neural Networks (as well as other high-powered computational stuff).

Neural Networks are something I learned about years ago, but of course back then it was mostly mathematical theory – we didn’t really have the tools to see them work in real time.

The funnest site I’ve found in a long time belongs to Tensor.  It’s got this amazing Neural Network Playground where you can screw around with Neurons and watch them try to figure out how to recognize an image.

You don’t need to know anything to get started.  Just click the big round play button at the top, and without even choosing any particular settings you can watch it go.

I strongly suggest checking this out.  I’ve wasted literally HOURS playing with it…


Vector Based Languages

March 8, 2019


After working in data science for a while there is one concept that I began to take for granted; Vectorization.

The term Vectorization comes from R.  It can have other names but I like Vectorization because it sounds cool.

In a normal programming language, if you want to add two arrays together it can be quite a grind.

Let’s say you want to do this in regular ‘ole Python (or C or any other ‘normal’ language), you would have to build an elaborate series of for-loops, like this:

d = [1,2,2,3,4]
e = [4,5,4,6,4]
f = []
for x in range(0, len(d)):
     f.append(d[x]*e[x])
print(f)
[4, 10, 8, 18, 16]
Mental

That’s all fine and good, but now imagine doing that with 2D matrices.  Or multiple arrays.  Or performing even more complex math on any of them.

In a Vector Based Language, you don’t have to go through that whole rigamarole.  Instead you can just do this:

d = np.array([1,2,2,3,4])
e = np.array([4,5,4,6,4])
print (d*e)
[4, 10, 8, 18, 16]

Vector Based Languages let you perform mathematical functions on entire lists or matrices as though they were single objects.

d = np.array([[1,2,2,3,4],
[3,2,8,7,12],
[11,21,26,3,43]])
e = np.array([[4,5,4,6,4],
[13,21,21,31,24],
[51,12,22,31,46]])

print (d*e)
[[   4 10    8 18 16]
 [  39 42  168 217 288]
 [ 561  252 572   93 1978]]

With a vectorized language, like R, or python with numpy, you can do these types of calculations simply and without concern about the underbelly of the process.

My Lord

Thank Thor for this technology. Staring at endless nested for-loops would cause me to pull my eyeballs out.

Again, I completely lost any appreciation for this important construct because getting knee deep in numpy or R will allow you to do that.  Just wait until you get back to your C programming!  Then you’ll appreciate it…


Creepy Ways to Invoke a Function in Python – Lambda

February 21, 2019


Whenever I begin learning a new language, I immediately get super cocky and say out loud “I know everything, I’m a genius how hard can it be!?”

… before those beastly bus terminal cops come and ask me to leave.

But, as always, while snoozing under the bridge covered in my own soil, I perk up and realize that this particular language has handed me a challenge.

Sexy

Python has a couple of very sexy ways of invoking a function.

Even the name is sexy, isn’t it? Python.

I never digress.

Sure you can just define a function…

def hello(first_name, last_name):
print("Hello World "+first_name+last_name) 
return

Then call it…

hello(“Matt”, “Hughes”)

But that’s is so 90’s.

Not Cool

“Hey nerd, did that function come with a free Beanie Baby??”

And ya I know python was invented in the 50’s to defeat Hitler’s Spanish Armada, but sometimes old things can still seem new.

Upon first exploring the lambda operator my brain contracted and shat out the word “No” several times.

(Shat is not a curse word.)

The lambda operator is a quick and useful way of declaring, using and throwing away a small function…

f = lambda x,y: print(“Hello World”+x+y)
f(“Matt”, “Hughes”)

The general expression of a lambda function is this…

lambda argument_list: expression

The lambda operator is useful when you just want a simple calculation done on a set of data ONCE. You don’t have to name it and you can use it with other operators like Map or Filter which makes it an extremely powerful tool…

myList = [0,1,2,3,4,5,6,7]
result = map(lambda x: x * 2, myList)

The above code block will apply the map operator to your little lambda function, multiplying everything in your list by 2.

Some developers don’t use this technique, mainly for readability reasons.  Some will claim that it is more difficult to study code like this then code with obvious declared functions.

On the other hand, many will contest that the lambda funciton is MORE readable and contructive than it’s generic counterpart.  In my opinion it comes down to a situational question.

A lamba call could theoretically stretch out to hundreds or even thousands of characters.  I’m not sure you’re going to find anyone claiming that this makes code easier to follow.  Quite the contrary, and probably a good spot to contruct a Function in it’s original intended shape.  (or an entire Class, but let’s not get into that just yet.)

However you end up using it, lamda can be a fantastic, fast (and fun) alternative.


Confusion Matrix – Confused Yet?

February 5, 2019


Bahaha.  I love the name of this thing.  I’m sure the stats world is pulling a fast one.

This is actually not super complicated, but for some reason I can never remember which is a Type 1 Error and which is a Type 2 Error.  I suspect suspect it’s because of all the fentanyl my mother did while she was pregnant.

Just joking.  She’s a lovely woman and never went any further than good old fashioned heroin.

If you remember anything from Uni-stats, you might remember that there are 4 types of outcomes when doing an experiment.  Two of these are correct outcomes, two are errors.

True Positive: This is a correct outcome.  It predicts that something is TRUE when in fact it actually is TRUE.  i.e. “A cancer test comes back positive and the cancer is there”.

True Negative:  This is also a correct outcome.  It’s when you predict something is FALSE when in fact it is actually FALSE.  i.e. “A cancer test comes back negative, and there is no cancer there”.

False Positive:  This is an error.  We predict something is TRUE when in fact it is really FALSE.  i.e. “A cancer test comes back positive, when in fact the person doesn’t have cancer.”  Also called a Type 1 Error.

False Negative: This is also an error.  It’s when we predict something is FALSE when it actually is TRUE.  i.e. “A cancer test comes back negative, but the person actually does have cancer.  Also called a Type 2 Error.

A confusion matrix is simply a way to visualize the results of an experiment.  Example:

Out of 165 test results we’ve had 150 successful ones.  50 of them were predicted to be NO and were in fact NO. 100 of them were predicted to be YES and were in fact YES.  

We’ve also had 15 Errors; 10 were Type 1 Errors and 5 were Type 2 Errors.

From these numbers you can start to calculate a whole slew of statistics.  I won’t go over ALL of them, (you can look that up), but here are a couple:

Accuracy: The number of correct results divided by the total number of results.  In this example 150/165 = 0.91 accuracy.

Misclassification Rate: Essentially this is the just the exact opposite of accuracy.  15/165 = 0.09 misclassification. (You could also just figure out that this is 1 minus the accuracy rate.).

As I said, there are a slew of other stats you can come up with from a Confusion Matrix, and they are all as simple to calculate as our example.


Data Visualization

January 30, 2019


There’s a part of me that detests data visualization.

There’s nothing wrong with it.  In fact it’s a part of this job and profoundly important to communication.

It just gets annoying dressing up my beautiful data so that some fool understand it; or should I say – LOOKS at it.

Here you go sir, here’s some important data that could significantly affect the future of your company (hand puppet on), and lookit sweety, it’s dressed up with the Diamond Steel Sunset palette.  Now you’re interested right?

Who wipes your poopy for you?

Important Info

If you’ve ever traveled across America (and you bloody should, it’s bloody amazing) and are somewhat literate you must have come across that USA Today rag.  The one that gets lovingly comped right next to the waffle maker first thing in the morning.

USA Today are the kings of deep data visualization featuring anything from Illegal Arms Sales in Africa to What Nation Eats the Most Pasta.

I love these visualizations.  I’m not being sarcastic, they are genuinely very fun to pour over when you’re nursing a hangover with a hot cup of brewed swill.

However they do provide a lesson; you can make data look like anything if you douse it in enough symmetry, carbohydrates or glam.

Python, (segue-way) is packed with enough visualization tools to make any product manager swoon.  So much, in fact, that I’m somewhat overwhelmed. Here is a quick list of some that I’ve been messing with.

Matplotlib

matplotlib

Native.  It comes with your Python distro even if you downloaded it in 2001 during the Cuban missile crisis.  It’s definitely not the prettiest, but it works great when you (a real engineer) need to know some quick answers.

Pandas

Pandas

Not quite native, but easily installed.  (plus, you should probably have Pandas included in any data problem anyway).  Ascetically it’s a step up from matplotlib and is pretty easy to implement.  There are a tonne of tutorials and examples online.

Seaborn

Seaborn. Cute.

Definitely the cutest of them all so far, and ridiculously simple to use.  In fact I strongly suggest checking out the Seaborn Examples Gallery.  It’s rammed with visualization examples and all the code you’ll need to make them happen.

R

R

Don’t forget R.  I kind of have and I’m still not a massive fan, but the language does support all of your bog-standard plots.  I find it trickier to use than the Python alternatives, but I also thought goats were female sheep.

Tableau

I’m not going to tackle Tableau right now; that would just continue my medicated rant…  We’ll save that for another time my pretties.


Data Normalization 101 – MinMaxScalar

January 6, 2019


As a part of any data cleanup process you’re probably going to want to normalize your data.

Like in golf, if you’re playing against Tiger Woods you’ll want to give him a handicap, just so that he won’t totally blow you away and that the game might still be fun.

Also with data you may want all things equal.  

Assume you’re trying to compare the height of a building and it’s internal temperature.  The two different units are fine, except when you’re trying to teach a machine about the data you may want the units to have some sort of equal measure.  (300 meters vs. 26 degrees.  To us that makes sense, but the actual value of the digits is clearly incongruous.)

There are many different ways of doing this.  Personally I like the MinMaxScalar function in the sklearn package.  First off it gives you pretty straight-forward results (everything ends up being between 0 and 1), and secondly, it’s the only one I know how to use so far.

So take it or leave it my pretties…

Let’s make a numpy array.  Here are 6 tuples of fictional height/temperature data:

buildingData = np.array([[300,24],[200,21],[126,18],[567,27],[420,19],[189,30]])
print (buildingData)

=

[[300  24]
 [200  21]
 [126  18]
 [567  27]
 [420  19]
 [189  30]]

As I said we’ll be using sklearn to do this stuff, so first you’ll need to import the MinMaxScalar function:

from sklearn.preprocessing import MinMaxScaler

Then we need to figure out the largest and smallest data point in your data set:

scaler_model = MinMaxScaler()
scaler_model.fit(buildingData)

Then scale the data appropriately:

scaled_data = scaler_model.transform(buildingData)

This will take the highest and lowest values in your data, turn them into 1 and 0 respectively, then stuff all the other values into relative numbers between 1 and zero.  So we get:

print (scaled_data)

=

[[ 0.39455782  0.5 ]
 [ 0.16780045  0.25 ]
 [ 0.          0. ]
 [ 1.          0.75 ]
 [ 0.66666667  0.08333333]
 [ 0.14285714  1. ]]

Voila, all of your data is (proportionately speaking) the same as it was before, but  reducing it into relative values between 0 and 1 will allow a machine to process it appropriately.


Kaggle

December 26, 2018


This site is amazing.  I might be out of the loop because I’ve only just recently discovered it.

Inevitably some antiseptic nerdling will scream about how behind the times I am and how she can’t believe I haven’t seen the new Han Solo movie yet.

Have Not Seen

Well, I also I just discovered the dark web too, but I’m scared of that.

Kaggle is the center of the Universe when it comes to learning Data Science.  First off, it’s got a DataSet section packed with stuff you can practice on (including the Titanic set we’ve looked at earlier) as well as related problems you can try to solve.

The exciting part however is the Competitions section.  Here you can pit your brains against other nerdlings to try and solve more complex problems.  There are even companies willing to pay some decent $$ to help out with their data.

A Nerd

Who wants to have a go??  I’ll crush your little brains my pretties!