Monday, 31 August 2020

Mutant algorithms

 The word "algorithm" has caused a storm in recent news in the UK. Due to COVID-19 school children were not able to sit their exams. This left 16 and 18 year olds waiting to see how they would be assessed, and had obvious implications for their academic or career futures.  As you may know, the grades were awarded based on an "algorithm", which our Prime Minister later described as mutant. According to the BBC news, he said "'Mutant algorithm' caused exam chaos." This begs the question, what does our PM think a mutant algorithm is?

The news in the UK has talked generally about the algorithm's inputs being course work and teacher's estimated grades. These are "mutated" (or adjusted) by the algorithm to take into account a school's performance over the last three years. This means schools whose pupils sometimes struggle are more likely to be down-graded. The precise details are buried in a 319 page report. Feel free to read it all and report back. TL;DR; Private and public schools tend to get higher grades than government run schools, so poorer pupils tended to get down-graded and richer pupils did not. Some form of mutation, or even perversion, perhaps of justice, but not in the algorithm. 

Now, some algorithms do use mutation. In fact genetic algorithms rely on mutation to seek out new solutions to problems. This is guided by a fitness function, to check the "mutant algorithm" is doing what's required. You can test such algorithms to see what they do, and keep an eye on them as they run to check they are heading the right way. You frequently spend a long time tuning parameters to get better results. This, on the face of it, has nothing whatsoever to do with the "mutant algorithm" our PM was talking about. 

There has also be a hint of slur on the programmers who wrote the algorithm, suggesting the idea was good and proper but the naughty programmers took it upon themselves to do something completely different that got out of hand, like a Marvel movie. Think Magneto (naughty programmers) versus Charles Francis Xavier (sensible people like, our PM? Go figure). I am sick of programmer bashing and the general misunderstanding of algorithms.

Where a genetic algorithm uses mutation, or a Monte Carlo simulation uses random numbers as input, it is still possible to test the algorithm is doing what you require. Programmers should never abdicate responsibility for what they have built. However, it is highly irresponsible of the news to allow propaganda and misrepresentations to flourish like this. 

A while ago, the Imperial College model for COVID-19 was open-sourced. At the time many people raised bug reports against it. One rumour suggested that running it twice with the same seed for the pseudo-random numbers would produce different results. Now, that might be described as a "mutant algorithm", but we'd usually describe this as buggy code. I don't believe our PM has the technical know-how to spot buggy code, but I'm willing to help him out if he wants. I'm also willing to be interviewed by the BBC to explain some of these technical issues in more detail, if they are interested. Or I could find other technical people who could equally well help out.

DM me.

https://twitter.com/fbuontempo


 



Sunday, 29 March 2020

Can a decision tree tell us about wine categories?

I previously wrote an overview showing how decision trees work: http://buontempoconsulting.blogspot.com/2019/07/decision-trees-for-feature-selection.html

This time, let's build a decision tree with some data. There are many freely available data sets used to explore machine learning, such as the Iris dataset, in the UCI repository.

So let's try another one. The so-called wine dataset. This has three types of wine, with 13 attributes. Though many blogs list the attributes, I have been unable to find out what these three mystery types of wine are. They are three different Italian cultivars, but I have no idea what.

Rather than concentrating on building a decision tree to accurately categorise the wine, giving us a way to predict the type of another wine based on some or all of the 13 attributes, let's build a tree and see what it says.

These data sets are so common, they can be loaded directly from many machine learning packages, such as the python module sklearn. This also has a DecisionTreeClassifier.

So,

from sklearn.datasets import load_wine
X = data.data
y = data.target
estimator = DecisionTreeClassifier(max_depth=2)
estimator.fit(X, y)

We asked for a maximum depth of 2, otherwise it makes a tree as deep (or high) as required to end up with leaves that are "pure" (or as pure as possible). In this case each is the same category of wine. Limiting the depth means it won't get as deep, or wide. But the first few layers will still show us which attributes are used to split up the data.

I say, "show", but we need to see the tree it's made. There are various ways to do this, but I'll use this:

from sklearn import tree
from IPython.display import SVG
from graphviz import Source
from IPython.display import display

graph = Source(tree.export_graphviz(estimator, out_file=None
   , feature_names=labels, class_names=['0', '1', '2']
   , filled = True))
display(SVG(graph.pipe(format='svg')))

Unfortunately, I've had to stick with class names, i.e. wine categories, of 0, 1 and 2, because I have no idea what they really are.

This generates the following picture:


The first line tells you the attribute and the cut off point chosen. For example, any wine with proline less than or equal to 755 goes down the left branch. The gini index is the measure used to decide which attribute or feature to split on. If you look up the decision tree classifier, you'll find other measures to try. The samples tell you how many at that node. We start with 178 wines, with 71 in class 1, with fewer in the other classes, so it reports class 1 at the first node.

For proline less than 755, we have 111 samples, still mostly in class 1. For proline greater than 755 we have 67 samples, mostly in class 0. These 67 samples can then be split on flavanoids. Anything less than 2.165 is class 2, according to this tree. Anything greater is class 0. We do have some class 2 wine on the left-most branch as well, however, I had a brief wander round the internet to read about flavanoids in wine.
Wikipedia says

In white wines the number of flavonoids is reduced due to the lesser contact with the skins that they receive during winemaking.

Is class 2 white wine? Who knows. It could be. The decision tree made this stand out far more clearly than looking directly at the input data.


I've put the code in a gist if you want to play around with it:
https://gist.github.com/doctorlove/bf6e42658d5806a61669a844b885983b

I think I've included everything here though.



I was planning on giving this as a lightning talk at the ACCU conference, but since it was cancelled this year, because of COVID-19, I wrote this short blog instead. If you can figure out what the types of wine are, get in touch.