This time, let's build a decision tree with some data. There are many freely available data sets used to explore machine learning, such as the Iris dataset, in the UCI repository.
So let's try another one. The so-called wine dataset. This has three types of wine, with 13 attributes. Though many blogs list the attributes, I have been unable to find out what these three mystery types of wine are. They are three different Italian cultivars, but I have no idea what.
Rather than concentrating on building a decision tree to accurately categorise the wine, giving us a way to predict the type of another wine based on some or all of the 13 attributes, let's build a tree and see what it says.
These data sets are so common, they can be loaded directly from many machine learning packages, such as the python module sklearn. This also has a DecisionTreeClassifier.
from sklearn.datasets import load_wine
X = data.data
y = data.target
estimator = DecisionTreeClassifier(max_depth=2)
We asked for a maximum depth of 2, otherwise it makes a tree as deep (or high) as required to end up with leaves that are "pure" (or as pure as possible). In this case each is the same category of wine. Limiting the depth means it won't get as deep, or wide. But the first few layers will still show us which attributes are used to split up the data.
I say, "show", but we need to see the tree it's made. There are various ways to do this, but I'll use this:
from sklearn import tree
from IPython.display import SVG
from graphviz import Source
from IPython.display import display
graph = Source(tree.export_graphviz(estimator, out_file=None
, feature_names=labels, class_names=['0', '1', '2']
, filled = True))
Unfortunately, I've had to stick with class names, i.e. wine categories, of 0, 1 and 2, because I have no idea what they really are.
This generates the following picture:
The first line tells you the attribute and the cut off point chosen. For example, any wine with proline less than or equal to 755 goes down the left branch. The gini index is the measure used to decide which attribute or feature to split on. If you look up the decision tree classifier, you'll find other measures to try. The samples tell you how many at that node. We start with 178 wines, with 71 in class 1, with fewer in the other classes, so it reports class 1 at the first node.
For proline less than 755, we have 111 samples, still mostly in class 1. For proline greater than 755 we have 67 samples, mostly in class 0. These 67 samples can then be split on flavanoids. Anything less than 2.165 is class 2, according to this tree. Anything greater is class 0. We do have some class 2 wine on the left-most branch as well, however, I had a brief wander round the internet to read about flavanoids in wine.
In white wines the number of flavonoids is reduced due to the lesser contact with the skins that they receive during winemaking.
Is class 2 white wine? Who knows. It could be. The decision tree made this stand out far more clearly than looking directly at the input data.
I've put the code in a gist if you want to play around with it:
I think I've included everything here though.
I was planning on giving this as a lightning talk at the ACCU conference, but since it was cancelled this year, because of COVID-19, I wrote this short blog instead. If you can figure out what the types of wine are, get in touch.