The Formula Isn’t Wrong
“Put a bird on it.” Like an episode of Portlandia that never ends, tech companies think you want them to “put AI/ML on it”. Scripts, automated action, data analytics, multivariate analysis, and Machine Learning all have a place delivering value.
I can’t help but notice how the “Machine Learning” label is being slapped on products that I know aren’t using it the way they’re proclaiming it.
So, you might be saying, “Joel… what’s the difference and why do I care if I get the outcome I want?”
Well, you should care, because not all “math” is created equal.
Statistics bring meaning from (or contextualize).
Suppose I’d like to improve my marathon time and get under 4 hours for those 26.2 miles. I’m analyzing my training runs to learn how to improve.
Here are my numbers:
- My pace over the last year is 10.5 minutes/mile
- That’s 26.2 miles at 4:36:06
- Here, it seems like I have a lot of work to do…
- If my pace over the last year is 9.75 minutes/mile
- That’s much better and has me completing the marathon in 4:15:27
- If I change the formula, I drop 21 minutes
- If my pace over the last year is 10.15 minutes/mile
- That’s worse and has me completing the marathon in 4:25:28
- Well, that’s not better, I’ll stick with Mode
How can I get three different numbers representing how well I’m running? Which one is right? Because I don’t want to win on paper — I want to improve my numbers!
If you’ve used Excel formulas, you’ve had moments when you look at the calculated answer and think, “Well that just doesn’t make any sense.” It’s never because the formula failed to do math correctly, it’s because you picked the wrong cells to be calculated or you picked the wrong formula.
I repeat: It’s never the formula that’s wrong, it’s that you picked the wrong formula or the wrong inputs.
When should you use mean, median, or mode? I’m not sure I remember junior high math and each one gives me slightly or extremely different answers.
I have to come back to, “what was my goal again?” Oh, right – run a marathon under 4 hours.
Financial experts are wizards with these statistics because they are experts in their context, they pick the right formulas, and they choose the right inputs. These three factors will be helpful as we move up the pyramid, so put them in your pocket. We’ll look at them again.
But I still haven’t unlocked the key to better running, so I’m going to upgrade my math.
The difference day to day in my running time is probably influenced by a number of factors. You might say there are multiple variables affecting my running time. Multivariate analysis lets me weigh the effect of these variables to find out which influences a faster pace and which slows me down.
Because I’m an obsessively detailed runner, I’ve kept logs, journals, and data on every run over the last year. I know the elevation changes of every run, what I ate the day before, my weight, the humidity, temperature, the clothing I wore, even whether I was getting along with my wife. There’s probably something in all of those variables that has affected my pace.
Multivariate analysis collects all these data points together and tells me how much each one relates to a faster pace (correlation). It’ll even tell me which ones definitely make a faster pace (causation). The results can be useful to tell me what to change, start, or stop doing:
- 15% of the time the temperature is above 70F and my pace is faster.
- 10% of the time I am fighting with my wife and my pace is faster.
- 20% of the time the humidity is low and my pace is faster.
Clearly, to improve my running, I should turn the temperature up in the basement where my treadmill is, purposely pick fights with my wife, and CLEARY I need to only run when… hold on.
My treadmill’s in my house. The humidity is constant inside and outdoor humidity has no effect. That’s an immaterial correlation. Multivariate analysis does this a lot.
It doesn’t care what variables you feed it, it’ll crunch the numbers.
Remember the three things I needed? Expert context, the right formulas, and the right inputs. Well, now I know I’ve got an erroneous input.
But did I get enough valuable inputs to make up for that erroneous input?
It seems reasonable that I should’ve accounted for my macronutrients, meal timing, and water ingestion. How many variables is enough?
If I turn up the heat and pick fights with my wife, will I run faster? Or are those not the key variables I should consider? What if I run slower when I’m sweaty and argumentative?
You’ll have encountered this confusion when Buzzfeed produces a story on, ‘5 Red Wine Health Benefits’, detailing how scientists have discovered that red wine makes you taller, smarter, younger, richer, and witty (or was that tequila?) Two days later, WebMD produces a story that says, ‘Beware Red Wine: a new study’ and they point to the same study used earlier to say the opposite thing.
Neither of these click-bait stories get it right.
Nearly every study produced concludes with, “we don’t know all the factors that should have been tested, but testing these factors produced this answer.”
And that’s the problem with multivariate analysis: you never know which variables you *should* have included.
You’ve got expert context.
You’ve probably got the right formulas.
But you don’t know the right inputs.
And since I don’t know the right inputs, it’s time for a running coach.
But before that, let’s have Machine Learning take a crack at solving what I need to do to run my marathon in under 4 hours.
What is Machine Learning?
Machine Learning is comprised of algorithms (machines) that get better through experience (learning). Machine learning just sounds snappier than “algorithmic experience.”
Sometimes that learning has no human interaction (unsupervised); other learning depends on a person to point the algorithms in the right direction (supervised).
And these algorithms are petty, insecure, and competitive. They keep score, brag about their achievements, and try to predict the future… at least that’s how I imagine numbers are in my mind…
Back to the marathon prep.
My Machine Learning coach has been watching Olympic marathon videos and analyzing their movements. It’s been watching what the NYC Marathon winners eat, drink, and wear. It’s tracked weather, altitude, pace, and even how close the final group was as they approached the finish line.
My diligent Machine Learning coach has taken all the data I’ve created, added it to the data they’ve learned, and finds what data matches and what doesn’t.
This coach tells me things about my running that I didn’t know I was doing — things I didn’t write down.
- “Your stomach is acidic every Saturday morning for your long runs, and you struggle between miles 15-17.”
- “Your left shoulder is sore and gets worse the closer you get to a race.”
- “You drive a car, not a truck or a van, and it’s a manual transmission.”
My Machine Learning coach is really Sherlock Holmes, pulled from the book and living in my computer.
And ML-Holmes continues his predictions, “You’ll improve your pace if you stop buying tomato sauce and olives, put your race timer on your right wrist or clip it to your belt, and have someone drive you to your races. Keep everything else the same and check back in a week.”
Sounds reasonable. Why not?
Expert context, correct formulas, right inputs
Like Sherlock Holmes, the Machine Learning wasn’t satisfied with the inputs that I’d given it. The ML has seen other examples and, dissatisfied with its answers (remember, these machines learn from experience), decided to collect my grocery list and my Apple iWatch’s raw accelerometer data to find its own clues.
It knew from experience that many runners ‘carb up’, and loading up with carbs would likely mean eating pasta, and based on my antacid purchases, there might be an issue.
It suspected (predicted) then confirmed my gait breaks down, first from repeatedly looking at the watch in the early race. This led to soreness and loss of upper body form, but it noted I only had lower body gait issues when my runs began somewhere that I had to drive to.
So, like Sherlock Holmes, Machine Learning created predictions about what data it could find, drew conclusions from those data, found the data, and verified its predictions.
“Most people look, Watson, few people really see.”
Machine Learning doesn’t have the ability to go and collect these data without being designed to do so in the context of its domain. Strong ML systems are only built by data scientists with deep insight of the problem being addressed, the context, and common sense judgement.
Data Scientists are craftsmen, skilled at their trade and the tools they’ve acquired over the years, and immersed in the problem’s context.
With experience in the context, they also know the right algorithms (tools) to employ in competition with each other. Some variables will change in importance, but with the right machine variety, the engine is ready for any reasonable change.
But both conditions matter: the tools and the context.
Plumbers and carpenters both carry toolboxes, but their experience in different contexts is what determines their trade and whether they are of value to you when you need a leaky pipe fixed.
Why it Matters to You
This is why Lucidum’s Machine Learning shines. A data scientist with a toolbox full of algorithms but no experience in IT or security can’t help you solve your unknown assets problem. You’ve got Machine Learning, but you don’t have visibility. Without this intelligent design, you’re just throwing numbers at algorithms and hoping the result has value.
Artificial Intelligence: What do you mean by that?
You have to start with Alan Turing, a mathematician who worked to crack the Engima code the Nazis used during World War II. At that time, ‘computer’ was a job description for the people who fed inputs into electro-mechanical devices. These electro-mechanical machines were called ‘bombes’. A little more significant when one of those crashes.
Turing posed the question, “What if machines could think?” (note: Turing, A.M. (1950). Computing Machinery and Intelligence. Mind 49: 433-460) This is how scientists compete. Not by building something no one else has but by thinking of something no one else has. And no fair trying, “what if fish could knit?”, that one’s been taken.
Like the smartest kid in school who has no friends, Turing immediately created impossible rules that had to be met before anyone could answer his question. Those rules we know as the ‘Turing Test’ and they demand that the machine think and act humanly and rationally.
If you’ve seen Data (for the Star Trek fans) or C3PO (for the Star Wars fans) or heard of either (for those with families and better things to do) then you know what filmmakers think Turing meant. You also know that’s not just around the corner.
Because it’s humiliating to be pantsed by someone from 70 years ago, technologists decided to create new labels, describing Turing’s outcome as “Strong AI” so they could apply the AI label to things that… well… that were not AI.
For a number of years, teams touted their “Weak AI” until marketers got ahold of them and said, “no one’s excited about the adjective ‘weak'” then it was renamed ‘narrow AI’ which resulted in congratulations and much back-patting all around because that’s also how European clothing fits and Europeans are cool.
Tell me what you want, what you really, really want
These degrees of applied mathematics each have a role where they shine. Remember Mean, Median and Mode?
Here’s what we know:
Statistics are descriptive and tell me what’s already happened.
Multivariate analysis can tell me how other factors influence what’s happened.
Machine learning ups the game with some predictive power and the ability to compete internally to improve outcome predictions and create new data that are amazing.
But these tools are only effective with expert context and the right inputs. What you need depends upon your goals.
Do you want to see everything in your enterprise? Then it’s not enough to run stats on what exists. Multivariate analysis won’t get you there.
You must have the invisible made visible, manageable, and actionable.
That’s what Lucidum’s patent-pending Machine Learning delivers, out of the box, without you being an expert. Limitless cyber asset visibility.
Oh and improving my pace? I switched to weight-lifting. Already have a journal going.