Monday 31 December 2018

Does machine learning really involve data?

Many definitions of machine learning start by proclaiming it uses data, to learn. I want to challenge this, or remind us where the term originally came from and consider why the meaning has shifted.

For a long time machine learning seemed to be a new technology, but I notice we're starting to say AI and machine learning interchangeably. Job postings often sneak the word scientist in there too. What is a data scientist? What do any of these words mean?

Current trends often come with an air of mystery. I suspect a lot of data science roles involve data entry, in order to clean input data. Not as appealing as the headline role suggests. Several day to day techniques being described as machine learning  could also be described as statistics. In fact, look at the table of contents of a statistics book, such as An Introduction to Statistical Learning. Look at a small selection of the topics:

  • accuracy
  • k-means clustering
  • making predictions
  • cross-validation
  • support vector machines, SVM
  • principal component analysis, PCA


Most, if not all, of these topics are covered in an average machine learning course and included in ML software packages. Yet statistics doesn't sound as exciting as machine learning, to many people.

Wikipedia defines statistics as "a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation." No mention of learning, though each of these activities form an essential part of data science. The article goes on to discuss descriptive and inferential statistics. Inference involves making predictions: many people use the term machine learning to mean the very same. Can you spot patterns in purchases automatically and suggest other items a customer might be interested in? Can you detect unusual or anomalous behaviour, indicating fraud or similar? Again, these are now labelled as AI or machine learning, but usually rely on well established statistical techniques. Admittedly, today's faster machines mean number crunching can happen quickly. This has contributed to the resurgence of machine learning.

Many problem solving algorithms are not about numbers. Some techniques, such as evolutionary computing, including genetic algorithms, don't fit comfortably into a data-driven view of learning. Do these methods count as machine learning? I'll leave that for you to think about. My book explores genetic algorithms and several other areas that do not need numbers to learn.

Arthur Samuel came out with the phrase "machine learning", by which he meant something along the lines of a "field of study that gives computers the ability to learn without being explicitly programmed." The abstract of his 1959 paper, "Some studies in machine learning using the game of checkers" states,

Two machine-learning procedures have been investigated in some detail using the game of checkers. Enough work has been done to verify the fact that a computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program. Furthermore, it can learn to do this in a remarkably short period of time (8 or 10 hours of machine-playing time) when given only the rules of the game, a sense of direction, and a redundant and incomplete list of parameters which are thought to have something to do with the game, but whose correct signs and relative weights are unknown and unspecified. The principles of machine learning verified by these experiments are, of course, applicable to many other situations.

AI and machine learning are both very old terms. I think they encompass a much broader field than data analysis. As a final thought, Turing designed an algorithm to play chess. In effect, he was trying to make an artificial brain, before the term AI was invented or computers, in their modern sense, existed.

I think machine learning is much broader than investigating data. Its history involves attempting to get computers to learn, and specifically to learn to play games.Let the games continue.


Read my book and see what you think.


Saturday 1 December 2018

I wrote a book

I've written a book pulling together some of my previous talks showing how to code your way out of a paper bag using a variety of machine learning techniques and models, including genetic algorithms.
It's on pre-order at Amazon and you can download free excerpts from the publishers website.

The sales figures show I've sold over 1,000 copies already. I'm going through the copy edits at the moment. I can't wait to see the actual paper book.

Thank you to everyone at ACCU who helped and encouraged me while I wrote this.

I will be giving some talks at conferences and hopefully some meetups based on ideas in some of the chapters in 2019.

Watch this space.

Tuesday 29 May 2018

Gitlab certificates

On Ubuntu, cloning a repo from a machine you don't have a certificate for will give the error:

fatal: unable to access 'https://servername': server certificate verification failed. CAFuile /etc/ssl/certs/your_filename CRLfile: None

You can work around this by tell git clone not to use the certificate e.g.

git config --system http.sslverify false


which is asking for trouble. However you can install the certificate, so you don't need to keep doing this. 

Using an answer here: https://stackoverflow.com/questions/21181231/server-certificate-verification-failed-cafile-etc-ssl-certs-ca-certificates-c  looks to have worked, by trying things one step at a time:

hostname=gitlab.city.ac.uk
port=443
trust_cert_file_location=`curl-config --ca`
sudo bash -c "echo -n | openssl s_client -showcerts -connect $hostname:$port \
    2>/dev/null  | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'  \
    >> $trust_cert_file_location"
I did try this first – so errors don’t end up in dev null:

openssl s_client -showcerts -connect $hostname:$port


Also, I first got the error sed: unrecognised option '--ca'
It took a moment to realise the --ca came from curl-config, which I needed to install.

Thursday 24 May 2018

Windows batch files

I've been writing a batch file to run some mathematical models over a set of inputs.
The models are software reliability growth models, described here.

We are using
  • du: Duane
  • go: Goel and Okumto
  • jm: Jelinski and Moranda
  • kl: Keiller and Littlewood
  • lm: Littlewood model
  • lnhpp: Littlewood non-homogeneous Poisson process
  • lv: Littlewood and Verrall
  • mo: Musa and Okumoto
Littlewood appears many times: he founded the group where I currently work. 

So, far too much background. I have one executable for each model, after making a make file; yet another story. And a folder of input files, named as f3[some dataset]du.dat, f3[some dataset]go.dat,... f3[some dataset]mo.dat. I also have some corresponding output files someone else produced a while ago, so in theory I can check I get the same numbers.I don't but that's going to be yet another story.

You can also use the original file and generated file to recalibrate, giving yet another file. Which I have previously generated results from. Which also don't match. 

I wanted to be able to run this on Ubuntu and Windows, and managed to make a bash script easily enough. Then I tried to make a Windows batch file to do the same thing. I'll just put my final result here, and point out the things I tripped up on several times.


ECHO OFF
setlocal EnableDelayedExpansion
setlocal 


for %%m in (du go jm kl lm lnhpp lv mo) do (
  echo %%m
  for %%f in (*_%%m.dat) do (
    echo %%~nf
    set var=%%~nf
    echo var: !var!
    set var=!var:~2!
    echo var now: !var!

    swrelpred\%%m.exe %%~nf.dat "f4!var!"
    swrelpred\%%mcal.exe %%~nf.dat "f4!var!" "f9!var!"
  )
)


1. First, turn the echo off because there's way too nosie otherwise.
2. Next, enable delayed expansion, otherwise things in blocks get expanded on sight and therefore don't change in the loop: "Delayed expansion causes variables delimited by exclamation marks (!) to be evaluated on execution"  from stack exchanges' Superuser site
3. Corollary: Use ! in the variables in the block not % for delayed expansion.
4.  But we're getting ahead of ourselves. The setlocal at the top means I don't set the variables back at my prompt. Without this, as I changed my script to fix mistakes it did something different between two runs, since a variable I had previously set might end up being empty when I broke stuff.
5. "Echo is off" spewed to the prompt means I was trying to echo empty variables, so the var: etc tells me which line something is coming from.
6. !var:~2! gives me everything from the second character, so I could drop the f3 at the start of the filename and make f4 and f9 files to try a diff on afterwards. Again pling for delayed expansion.




I suspect I could improve this, but it's six importnat things to remember another time.

Writing this in Python might have been easier. Or perhaps I should learn Powrshell one day.