A Dataset for Teaching Clustering – The Fruit Dataset

Clustering, a type of unsupervised learning, is one of the main categories of machine learning. It has the potential to enable human-like cognition in A.I., but its results are not necessarily intuitive for humans, or similar to the way humans tend to group objects and create new categories.

Arguably this is because while clustering and discovery of ‘natural categories’ is a basic human cognitive activity, machine learning algorithms create clusters in very different ways from humans, and the results do not necessarily behave in the ways human-created clusters do. This problem is made worse by the fact that visualizing multi-dimensional clusters is very difficult, so we often need to more or less ‘trust’ the results of the algorithm (a common issue across machine learning techniques).

Helping people build better intuitions about how machine created clusters look and act could improve the situation. With this in mind, I’ve created a very simple practice dataset – the fruit dataset – to illustrate clustering results in a way that lets people more easily compare their human-centric clustering expectations to the results produced by different machine learning algorithms.

This small dataset is created from a number of images of apples and pears. The collection of images is intended to provoke some questions about what counts as a good clustering result, what strategies would lead to such a result, and how context dependent and generalizable such results are (or should be).

You can download the fruit dataset here, and the metadata for the dataset here.

Each image has a small number of other measures associated with it. Categorical, ordinal, binary, integer and continuous variables allow for the application and comparison of different distance metrics, clustering algorithms and clustering quality metrics. Some of these variables are intended to be more superficial while others are intended to be more closely connected to what we would consider the ‘natural kinds’ present in the dataset.

Most importantly, any results can be compared with an ‘eyeball analysis’ of the resulting clusters, by looking at the images grouped in each cluster to see how the results compare to our human-centric perspective on what a good clustering result would look. In doing this comparison, we can form our own opinions about how successful, or useful, different clustering strategies are.

Post Author: Jen Schellinck