The machine learning theme continues to be popular at the F#unctional Londoners meetup group. Last night Matt Moloney gave a great hands on session on k-means clustering. Matt has worked on large machine learning systems at e-Bay. More recently he has been working on the Tsunami IDE, an extensible REPL environment for the desktop and cloud.

Tsunami provides a lightweight environment focused on interactive development, very suited to machine learning. And with F# 3 Type Providers you get typed access to a diverse set of data from CSV files all the way up to Hadoop. Interestingly Tsunami can be embedded in to Excel and used as a replacement for VBA.

Grey Young describes Tsunami as a REPL on steroids.

k-means clustering has a number of interesting application areas, from search to pharmaceuticals. For the session Matt provided an F# script to analyse the canonical iris data set (flowers). The script also produces a variety of charts for visualizing the data including animated gifs showing the centroid positions at each iteration:

The FSharp.Data CSV Type Provider, available on Nuget, gives typed access over CSV files and was used to extract the values from the iris data file:

type Iris = CsvProvider<irisDataFile> let iris = Iris.Load(irisDataFile) let irisData = iris.Data |> Seq.toArray /// classifcations let y = irisData |> Array.map (fun row -> row.Class) /// feature vectors let X = irisData |> Array.map (fun row -> [|row.``Sepal Length`` row.``Sepal Width`` row.``Petal Length`` row.``Petal Width`|])

Computing k-means centroids:

let K = 3 // The Iris dataset is known to only have 3 clusters let seed = [|X.[0]; X.[1]; X.[2]|] // pick bad centroids on purpose let centroidResults = KMeans.computeCentroids seed X |> Seq.take iterationLimit

I was particularly impressed by the conciseness of Matt’s implementation of the algorithm:

(* K-Means Algorithm *) /// Group all the vectors by the nearest center. let classify centroids vectors = vectors |> Array.groupBy (fun v -> centroids |> Array.minBy (distance v)) /// Repeatedly classify the vectors, starting with the seed centroids let computeCentroids seed vectors = seed |> Seq.iterate (fun centers -> classify centers vectors

` |> Array.map (snd >> average))`

Thanks again to Matt for giving a really interesting session.

If you’re interested in learning more Matt’s also giving an in depth session on machine learning at the Progressive F# Tutorials in London at the end of October:

And if you’re in New York next week you can catch Rachel Reese give an introduction to data science followed by a machine learning introduction with Mathias Brandewinder and I at the Progressive F# Tutorials NYC.