PREDICTING ALGAE BLOOMS

(based on a Case Study by L.Torgo)

This notebook provides a concrete and thorough illustration of the data science process using a realistic dataset: algae_blooms.csv, which is also available at the UCI Machine Learning Repository.

The problem is to predict the occurrence of harmful algae in water samples.

We also use it to highlight various aspects of data exploration, data cleaning, and R syntax.


OUTLINE


Back to top

PROBLEM DESCRIPTION

The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.

The dataset which is used to train the learning model consists of:

  • chemical properties of various water samples of European rivers
  • the quantity of seven algae in each of the samples, and
  • the characteristics of the collection process for each sample.

What is the data science motivation for such a model? After all, we can analyze water samples to determine if various harmful algae are present or absent.

The answer is simple: chemical monitoring is cheap and easy to automate, whereas biological analysis of samples is expensive and slow.

Another answer is that analyzing the samples for harmful content does not provide a better understanding of algae drivers: it just tells us which samples contain algaes.


Back to top

LOADING THE DATA

Before we can take a look at the data and begin the process in earnest, we need to load it in the in the R workspace.

The dataset is stored in the CSV file algae_blooms.csv (in the Data directory).

In [1]:
# The data can also be loaded as part of Torgo's DMwR package, 
# which could be loaded directly as below (by uncommenting the 
# line of code), but that is not how we will access it.

# library(DMwR)

As the data resides in a CSV file, we use the read.csv R function to load it and store it to an R data frame (Torgo stores it to a tibble object).

(Note the sep and header options -- what do you think their function is?)

In [2]:
algae_blooms<-read.csv("Data/algae_blooms.csv", sep=",", header=TRUE)

We can get an idea of the data frame's structure by calling the str function.

In [3]:
str(algae_blooms)
'data.frame':	340 obs. of  18 variables:
 $ season: Factor w/ 4 levels "autumn","spring",..: 4 2 1 2 1 4 3 1 4 4 ...
 $ size  : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ speed : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 1 1 1 3 1 ...
 $ mxPH  : num  8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ...
 $ mnO2  : num  9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ...
 $ Cl    : num  60.8 57.8 40 77.4 55.4 ...
 $ NO3   : num  6.24 1.29 5.33 2.3 10.42 ...
 $ NH4   : num  578 370 346.7 98.2 233.7 ...
 $ oPO4  : num  105 428.8 125.7 61.2 58.2 ...
 $ PO4   : num  170 558.8 187.1 138.7 97.6 ...
 $ Chla  : num  50 1.3 15.6 1.4 10.5 ...
 $ a1    : num  0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ...
 $ a2    : num  0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ...
 $ a3    : num  0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ...
 $ a4    : num  0 1.9 0 0 0 0 3.9 0 0 2.9 ...
 $ a5    : num  34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ...
 $ a6    : num  8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ...
 $ a7    : num  0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...

Evidently, algae_blooms is a data frame with 340 observations of 18 variables each.

Notes:

  • 3 of the fields are categorical (season, size, speed, which refer to the data collection process)
  • of the numerical fields, 8 have names that sound vaguely "chemical"
  • presumably, the remaining fields refer to the algae blooms

We can get a better feel for the data frame by observing it in its natural habitat, so to speak, using the head() or tail() functions.

In [4]:
head(algae_blooms)
seasonsizespeedmxPHmnO2ClNO3NH4oPO4PO4Chlaa1a2a3a4a5a6a7
winter small medium 8.00 9.8 60.800 6.238 578.000105.000170.00050.0 0.0 0.0 0.0 0.0 34.2 8.3 0.0
spring small medium 8.35 8.0 57.750 1.288 370.000428.750558.750 1.3 1.4 7.6 4.8 1.9 6.7 0.0 2.1
autumn small medium 8.10 11.4 40.020 5.330 346.667125.667187.05715.6 3.3 53.6 1.9 0.0 0.0 0.0 9.7
spring small medium 8.07 4.8 77.364 2.302 98.182 61.182138.700 1.4 3.1 41.0 18.9 0.0 1.4 0.0 1.4
autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.58010.5 9.2 2.9 7.5 0.0 7.5 4.1 1.0
winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.66728.4 15.1 14.6 1.4 0.0 22.5 12.6 2.9

Back to top

SUMMARY AND VISUALIZATION

As it happens, we are not given an awful lot of information about the dataset's domain (I, for one, remain woefully ill-prepared to deal with matters of a chemical nature, to my eternal shame).

Data exploration, in the form of summaries and visualization, can provide a better handle on the problem at hand.

A call to the summary function provides frequency counts for categorical variables, and 6-pt summaries for numerical variables. As a bonus, the number of missing values is also tabulated.

(the default setting only lists a limited number of categorical levels -- the summary documentation will explain how to increase the number of levels that are displayed.)

IMPORTANT NOTE: I may have given you the impression (just now) that exploration is only really necessary when domain expertise escapes us. Domain expertise can help analysts frame the problem and the analysis results in the appropriate manner, but it often also gives them a false sense of security. Errors can creep anywhere -- data exploration at an early stage may save you a lot of embarassing back-tracking at a later stage.

In [5]:
summary(algae_blooms)
    season       size        speed          mxPH            mnO2       
 autumn:80   large : 83   high  :142   Min.   :5.600   Min.   : 1.500  
 spring:84   medium:136   low   : 58   1st Qu.:7.750   1st Qu.: 7.925  
 summer:86   small :121   medium:140   Median :8.045   Median : 9.700  
 winter:90                             Mean   :7.997   Mean   : 9.157  
                                       3rd Qu.:8.393   3rd Qu.:10.800  
                                       Max.   :9.700   Max.   :13.400  
                                       NA's   :2       NA's   :2       
       Cl               NO3              NH4                oPO4        
 Min.   :  0.222   Min.   : 0.000   Min.   :    5.00   Min.   :   1.00  
 1st Qu.: 10.994   1st Qu.: 1.147   1st Qu.:   37.86   1st Qu.:  13.00  
 Median : 32.470   Median : 2.356   Median :  107.36   Median :  37.24  
 Mean   : 42.517   Mean   : 3.121   Mean   :  471.73   Mean   :  73.09  
 3rd Qu.: 57.750   3rd Qu.: 4.147   3rd Qu.:  244.90   3rd Qu.:  88.11  
 Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :1435.00  
 NA's   :16        NA's   :2        NA's   :2          NA's   :2        
      PO4              Chla               a1              a2        
 Min.   :   1.0   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:  40.0   1st Qu.:  2.133   1st Qu.: 1.50   1st Qu.: 0.000  
 Median : 101.5   Median :  5.111   Median : 7.10   Median : 2.800  
 Mean   : 136.7   Mean   : 12.796   Mean   :16.70   Mean   : 7.201  
 3rd Qu.: 200.2   3rd Qu.: 17.200   3rd Qu.:25.18   3rd Qu.:10.150  
 Max.   :1690.0   Max.   :110.456   Max.   :89.80   Max.   :72.600  
 NA's   :7        NA's   :23                                        
       a3               a4              a5               a6        
 Min.   : 0.000   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 1.400   Median : 0.00   Median : 2.200   Median : 0.000  
 Mean   : 3.904   Mean   : 1.81   Mean   : 5.516   Mean   : 6.411  
 3rd Qu.: 4.625   3rd Qu.: 2.30   3rd Qu.: 8.000   3rd Qu.: 7.025  
 Max.   :42.800   Max.   :44.60   Max.   :61.100   Max.   :77.600  
                                                                   
       a7        
 Min.   : 0.000  
 1st Qu.: 0.000  
 Median : 0.000  
 Mean   : 2.206  
 3rd Qu.: 2.200  
 Max.   :31.600  
                 

Notes:

  • The chemical variables all have missing values, ranging from only 2 to 7, 16, and 23.
  • The observations seem fairly uniformly distributed in terms of the seasons, but large rivers and low speed rivers are not represented as often as their counterparts.
  • All numerical values are non-negative, which makes sense in the context of concentrations
  • I do not know what the range of the chemical values should take in a real-world context, but some of the maximum values seem ... unrealistic (NH4!!, oPO4, a7, etc.)
  • do you spot anything else?

Of course, these summaries each apply to a single variable (1-way tables). Can we spot anything else using $n$-way tables (on the categorical variables)?

In [6]:
# 2-way tables
table(algae_blooms$speed,algae_blooms$size)
table(algae_blooms$speed,algae_blooms$season)
table(algae_blooms$season,algae_blooms$size)

# 3-way table
table(algae_blooms$season,algae_blooms$size,algae_blooms$speed)
        
         large medium small
  high      13     56    73
  low       32     24     2
  medium    38     56    46
        
         autumn spring summer winter
  high       32     34     38     38
  low        16     13     12     17
  medium     32     37     36     35
        
         large medium small
  autumn    19     33    28
  spring    21     34    29
  summer    19     36    31
  winter    24     33    33
, ,  = high

        
         large medium small
  autumn     3     13    16
  spring     3     14    17
  summer     3     16    19
  winter     4     13    21

, ,  = low

        
         large medium small
  autumn     9      6     1
  spring     7      6     0
  summer     6      6     0
  winter    10      6     1

, ,  = medium

        
         large medium small
  autumn     7     14    11
  spring    11     14    12
  summer    10     14    12
  winter    10     14    11

The 6-pt summary provides some information about the underlying distribution, but not much on the parametric front. A more traditional summary can be displayed using the psych library's describe function.

In [7]:
library(psych)
describe(algae_blooms)
varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
season* 1 340 2.547059 1.1186895 3.0000 2.558824 1.482600 1.000 4.000 3.000 -0.05447289 -1.3652641 0.06066946
size* 2 340 2.111765 0.7676208 2.0000 2.139706 1.482600 1.000 3.000 2.000 -0.19150312 -1.2876572 0.04163008
speed* 3 340 1.994118 0.9120437 2.0000 1.992647 1.482600 1.000 3.000 2.000 0.01153866 -1.8012592 0.04946251
mxPH 4 338 7.997293 0.5783188 8.0450 8.035864 0.467019 5.600 9.700 4.100 -0.84022022 1.9865897 0.03145640
mnO2 5 338 9.156716 2.3130799 9.7000 9.388199 1.927380 1.500 13.400 11.900 -0.93926049 0.5497464 0.12581496
Cl 6 324 42.517246 44.4906037 32.4700 34.997591 33.478591 0.222 391.500 391.278 2.84856749 14.0485533 2.47170021
NO3 7 338 3.120784 3.2851622 2.3555 2.699504 2.181646 0.000 45.650 45.650 6.78743778 81.1290663 0.17868927
NH4 8 338 471.734411 1739.0774580107.3570 156.746838 119.949753 5.000 24064.000 24059.000 9.11841709 105.4980942 94.59334335
oPO4 9 338 73.091882 114.1420517 37.2430 51.792802 43.460936 1.000 1435.000 1434.000 5.99067080 60.0469965 6.20850914
PO410 333 136.685700 149.4773125101.4550 115.867652 114.999352 1.000 1690.000 1689.000 4.22624415 35.1788385 8.19130627
Chla11 317 12.796196 18.0813363 5.1110 8.782608 6.094969 0.200 110.456 110.256 2.61867532 7.8575379 1.01554902
a112 340 16.701765 20.9987076 7.1000 12.593750 10.526460 0.000 89.800 89.800 1.50408832 1.4996118 1.13881481
a213 340 7.200882 10.7549412 2.8000 4.854412 4.151280 0.000 72.600 72.600 2.32801696 6.7216331 0.58326858
a314 340 3.904412 6.4205247 1.4000 2.358456 2.075640 0.000 42.800 42.800 2.41507584 6.7202844 0.34820184
a415 340 1.810000 3.8292948 0.0000 1.036765 0.000000 0.000 44.600 44.600 5.95164312 52.8944003 0.20767267
a516 340 5.515588 8.4186630 2.2000 3.692279 3.261720 0.000 61.100 61.100 2.75624792 10.4260075 0.45656610
a617 340 6.411471 12.3978237 0.0000 3.279779 0.000000 0.000 77.600 77.600 2.88057370 9.2050442 0.67236639
a718 340 2.206471 4.9472217 0.0000 1.040074 0.000000 0.000 31.600 31.600 4.09036063 18.5615144 0.26830077

Notes:

  • the categorical variables are marked by an asterisk *; the levels are coded with an integer, and treated as numerical variables for the purpose of the analysis, so the results for these fields are meaningless
  • the trimmed variable refers to the trimmed mean, the mean obtained when a certain percentage of the observations are removed from both end of the spectrum (what percentage, exactly?)
  • the mad variable refers to the median absolute deviation (from the median)

I personally find such a table hard to read and really grasp once there are more than a few variables in the dataset. Visualization comes in handy then.

Basic histograms can be constructed with the hist() function.

In [8]:
hist(algae_blooms$mnO2)

Based on this histogram, we can conclude that the underlying distribution of mnO2 has a negative skew, say, which is confirmed by the table above.

This basic histogram can be spruced up by using Hadley Wickham's (in)famous ggplot2 library.

In [9]:
library(ggplot2)
Attaching package: ‘ggplot2’

The following objects are masked from ‘package:psych’:

    %+%, alpha

In [10]:
ggplot(algae_blooms,aes(x=mnO2)) +        # plotting mnO2 from the algae_blooms dataset ...
    geom_histogram(aes(y=..density..)) +  # as a histogram, where the vertical axis is the density ... 
    geom_density(color="blue") +          # on which will be layered a blue density curve ... 
    geom_rug() +                          # and a rug (or comb) showing where the observations actually fall...
    ggtitle("Histogram of minimum value of O2 among 340 observations") + # with this title ... 
    xlab("") +                                                           # no x axis label ...   
    ylab("")                                                             # and no y axis label
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Removed 2 rows containing non-finite values (stat_bin).”Warning message:
“Removed 2 rows containing non-finite values (stat_density).”

While mnO2 clearly does not follow a normal distribution (no negative value, high skew), we see that such an approximation wouldn't be so bad compared to a1.

In [11]:
ggplot(algae_blooms,aes(x=a1)) + 
    geom_histogram(aes(y=..density..)) +
    geom_density(color="red") + 
    geom_rug() +
    ggtitle("Histogram of minimum value of a1 among 340 observations") + 
    xlab("") +
    ylab("")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

$qq-$plots, another traditional statistical plot, can be produced with the car libary's qqPlot() function.

In [12]:
library(car)
Attaching package: ‘car’

The following object is masked from ‘package:psych’:

    logit

In [13]:
qqPlot(algae_blooms$mnO2, main='Normal QQ plot for minimum values of O2', ylab="")

Again, we can see that the normal distribution is not a good fit for mnO2, but that it is even worse for a1 (see below).

In [14]:
qqPlot(algae_blooms$a1, main='Normal QQ plot for a1', ylab="")

Now let's take a look at some of the odd values for NH4.

In [15]:
ggplot(algae_blooms,aes(x=factor(0),y=NH4)) +   # plotting NH4 from the algae_blooms dataset ...
    geom_boxplot() +                            # as a boxplot ... 
    geom_rug() +                                # with a rug on which the true values are shown ...
    geom_hline(aes(yintercept=mean(algae_blooms$NH4, na.rm=TRUE)), linetype=2, colour="pink") + # and a horizontal line showing where the mean NH4 value falls ...
    ylab("Ammonium (NH4+)") +                   
    xlab("") +
    scale_x_discrete(breaks=NULL)
Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”

We see that there are a string of values falling way above the boxplot. If the underlying distribution was normal, say, these would definitely be considered outliers.

Let's investigate further.

In [16]:
plot(algae_blooms$NH4, xlab="", ylab="Ammonium (NH4+)") # scatter plot of NH4 against observation number
abline(h=mean(algae_blooms$NH4, na.rm=TRUE), lty=1)     # mean value of NH4, solid line
abline(h=mean(algae_blooms$NH4, na.rm=TRUE) + sd(algae_blooms$NH4, na.rm=TRUE), lty=2) # mean + sd value of NH4, dashed line
abline(h=median(algae_blooms$NH4, na.rm=TRUE), lty=3)   # median value of NH4, tight dashed line

We can also look at the data and see which observations have values of NH4 below 3000 (roughly all values below the long dashed line above)

In [17]:
algae_blooms[-which(algae_blooms$NH4>3000),]        # keeping all values for which NH4 is NOT > 3000
nrow(algae_blooms[-which(algae_blooms$NH4>3000),])  # seeing how many are being kept
seasonsizespeedmxPHmnO2ClNO3NH4oPO4PO4Chlaa1a2a3a4a5a6a7
1winter small medium 8.00 9.8 60.800 6.238 578.000105.000 170.000 50.000 0.0 0.0 0.0 0.0 34.2 8.3 0.0
2spring small medium 8.35 8.0 57.750 1.288 370.000428.750 558.750 1.300 1.4 7.6 4.8 1.9 6.7 0.0 2.1
3autumn small medium 8.10 11.4 40.020 5.330 346.667125.667 187.057 15.600 3.3 53.6 1.9 0.0 0.0 0.0 9.7
4spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.400 3.1 41.0 18.9 0.0 1.4 0.0 1.4
5autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.500 9.2 2.9 7.5 0.0 7.5 4.1 1.0
6winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.400 15.1 14.6 1.4 0.0 22.5 12.6 2.9
7summer small high 8.15 10.3 73.250 1.535 110.000 61.250 111.750 3.200 2.4 1.2 3.2 3.9 5.8 6.8 0.0
8autumn small high 8.05 10.6 59.067 4.990 205.667 44.667 77.434 6.900 18.2 1.6 0.0 0.0 5.5 8.7 0.0
9winter small medium 8.70 3.4 21.950 0.886 102.750 36.300 71.000 5.544 25.4 5.4 2.5 0.0 0.0 0.0 0.0
10winter small high 7.93 9.9 8.000 1.390 5.800 27.250 46.600 0.800 17.0 0.0 0.0 2.9 0.0 0.0 1.7
11spring small high 7.70 10.2 8.000 1.527 21.571 12.750 20.750 0.800 16.6 0.0 0.0 0.0 1.2 0.0 6.0
12summer small high 7.45 11.7 8.690 1.588 18.429 10.667 19.000 0.600 32.1 0.0 0.0 0.0 0.0 0.0 1.5
13winter small high 7.74 9.6 5.000 1.223 27.286 12.000 17.000 41.000 43.5 0.0 2.1 0.0 1.2 0.0 2.1
14summer small high 7.72 11.8 6.300 1.470 8.000 16.000 15.000 0.500 31.1 1.0 3.4 0.0 1.9 0.0 4.1
15winter small high 7.90 9.6 3.000 1.448 46.200 13.000 61.600 0.300 52.2 5.0 7.8 0.0 4.0 0.0 0.0
16autumn small high 7.55 11.5 4.700 1.320 14.750 4.250 98.250 1.100 69.9 0.0 1.7 0.0 0.0 0.0 0.0
17winter small high 7.78 12.0 7.000 1.420 34.333 18.667 50.000 1.100 46.2 0.0 0.0 1.2 0.0 0.0 0.0
18spring small high 7.61 9.8 7.000 1.443 31.333 20.000 57.833 0.400 31.8 0.0 3.1 4.8 7.7 1.4 7.2
19summer small high 7.35 10.4 7.000 1.718 49.000 41.500 61.500 0.800 50.6 0.0 9.9 4.3 3.6 8.2 2.2
21winter small medium 7.83 10.7 88.000 4.825 1729.000467.500 586.000 16.000 0.0 0.0 0.0 6.8 6.1 0.0 0.0
22spring small high 7.20 9.2 0.800 0.642 81.000 15.600 18.000 0.500 15.5 0.0 0.0 2.3 0.0 0.0 0.0
23autumn small high 7.75 10.3 32.920 2.942 42.000 16.000 40.000 7.600 23.2 0.0 0.0 0.0 27.6 11.1 0.0
24winter small high 7.62 8.5 11.867 1.715 208.333 3.000 27.500 1.700 74.2 0.0 0.0 3.7 0.0 0.0 0.0
25spring small high 7.84 9.4 10.975 1.510 12.500 3.000 11.500 1.500 13.0 8.6 1.2 3.5 1.2 1.6 1.9
26summer small high 7.77 10.7 12.536 3.976 58.500 9.000 44.136 3.000 4.1 0.0 0.0 0.0 9.2 10.1 0.0
27winter small high 7.09 8.4 10.500 1.572 28.000 4.000 13.600 0.500 29.7 0.0 0.0 4.9 0.0 0.0 0.0
28autumn small high 6.80 11.1 9.000 0.630 20.000 4.000 NA 2.700 30.3 1.9 0.0 0.0 2.1 1.4 2.1
29winter small high 8.00 9.8 16.000 0.730 20.000 26.000 45.000 0.800 17.1 0.0 19.6 0.0 0.0 0.0 2.5
30spring small high 7.20 11.3 9.000 0.230 120.000 12.000 19.000 0.500 33.9 1.0 14.6 0.0 0.0 0.0 0.0
31autumn small high 7.40 12.5 13.000 3.330 60.000 72.000 142.000 4.900 3.4 16.0 1.2 0.0 15.3 15.8 0.0
311autumn large medium 8.40 11.43 4.96600.969 24.111 6.000 18.167 2.133 34.7 1.4 2.3 0.0 0.0 0.0 0.0
312spring large high 8.20 9.90 6.40000.553 21.429 12.000 76.286 1.300 16.0 2.1 2.6 0.0 26.5 10.1 0.0
313autumn large high 8.00 10.98 9.70000.874 67.700 26.600 51.034 2.200 26.7 0.0 0.0 0.0 0.0 4.9 0.0
314autumn large low 8.30 8.90 42.05805.922 116.727 150.583 220.723 6.700 0.0 45.4 0.0 0.0 0.0 0.0 3.8
315spring large low 8.70 6.80 16.88902.139 30.000 37.111 85.444 23.033 2.2 12.0 1.2 0.0 0.0 0.0 1.0
316winter large medium 8.60 10.40 15.18202.502 140.909 31.909 77.700 15.318 4.8 0.0 0.0 0.0 8.7 23.3 30.8
317summer large medium 8.00 9.10 15.37502.118 43.750 48.875 86.500 8.125 0.0 0.0 1.0 0.0 9.2 7.7 10.1
318summer large medium 8.20 9.50 17.87502.363 63.750 44.000 77.000 8.463 1.5 29.5 22.1 5.4 0.0 1.5 0.0
319spring large medium 8.50 9.60 16.54503.849 103.273 34.273 63.400 14.682 18.8 17.1 2.2 0.0 4.6 3.7 0.0
320spring large medium 8.04 9.30 130.26303.776 131.008 97.500 152.966 6.150 0.0 8.7 0.0 1.4 2.2 4.1 1.9
321autumn large medium 7.95 9.10 76.88603.461 93.827 68.333 146.049 3.950 1.2 31.2 5.7 0.0 3.5 3.5 0.0
322summer small high 7.25 9.54 NA0.642 85.000 14.600 19.450 0.460 15.5 0.0 0.0 2.3 0.0 0.0 0.0
323autumn small high 7.64 10.30 34.23502.942 41.430 17.000 41.567 7.430 23.2 0.0 0.0 0.0 27.6 11.1 0.0
324winter small high 7.92 8.50 10.86701.715 199.540 3.222 27.200 1.900 74.2 0.0 0.0 3.7 0.0 0.0 0.0
325spring small high 7.62 9.40 11.05501.510 13.560 4.000 12.650 1.456 13.0 8.6 1.2 3.5 1.2 1.6 1.9
326summer medium high 7.75 10.70 15.50003.976 57.640 10.500 43.169 3.120 4.1 0.0 0.0 0.0 9.2 10.1 0.0
327winter small high 7.08 8.40 9.45001.572 26.540 4.000 13.600 0.675 29.7 0.0 0.0 4.9 0.0 0.0 0.0
328summer small high 6.92 11.10 9.10000.630 21.000 5.000 NA 2.460 30.3 1.9 0.0 0.0 2.1 1.4 2.1
329winter small high 8.10 9.80 14.34000.730 22.500 23.000 45.500 0.850 17.1 0.0 19.6 0.0 0.0 0.0 2.5
330spring small high 7.20 11.30 8.97000.230 134.500 13.000 19.000 NA 33.9 1.0 14.6 0.0 0.0 0.0 0.0
331spring large medium 8.61 10.10 3.51800.663 12.220 3.222 7.000 1.300 48.3 2.0 0.0 0.0 0.0 0.0 0.0
332summer large medium 8.22 9.50 2.30000.672 9.870 4.000 6.123 0.800 50.4 3.8 0.0 0.0 0.0 0.0 0.0
333winter large medium 8.53 10.50 3.00000.758 10.350 4.100 NA 4.000 56.8 5.0 0.0 0.0 0.0 0.0 0.0
334summer large medium 8.40 10.00 3.51000.866 29.650 5.800 15.000 2.860 17.3 6.7 19.7 0.0 0.0 0.0 0.0
335winter large high 8.10 10.90 9.05600.825 41.000 20.000 58.000 NA 16.8 19.6 4.0 0.0 0.0 0.0 0.0
336summer medium high 8.12 10.20 7.61300.699 33.560 28.034 49.658 2.200 18.1 1.7 2.0 0.0 1.7 5.9 0.0
337winter large low 8.43 10.80 35.64206.225 134.000 103.500 NA 45.375 1.1 3.9 2.1 0.0 3.9 4.6 2.3
338winter large low 8.70 11.70 21.46563.765 91.450 38.000 83.000 17.000 0.0 4.7 0.0 0.0 2.6 2.6 0.0
339summer large low 8.10 8.20 26.54002.805 42.750 48.500 88.125 13.980 0.0 12.0 1.7 0.0 2.7 0.0 0.0
340autumn large low 8.35 11.10 22.56003.140 76.200 41.000 98.665 17.456 1.7 7.0 1.2 0.0 4.8 3.1 0.0
329

Let's see what the boxplot above looks like without the suspected outliers.

In [18]:
ggplot(algae_blooms[-which(algae_blooms$NH4>3000),],aes(x=factor(0),y=NH4)) + 
    geom_boxplot() + 
    geom_rug() +
    geom_hline(aes(yintercept=mean(algae_blooms[-which(algae_blooms$NH4>3000),8], na.rm=TRUE)), linetype=2, colour="pink") +
    ylab("Ammonium (NH4+)") +
    xlab("") +
    scale_x_discrete(breaks=NULL)
Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”

It's a bit better, to be sure (the box structure has expanded). There still seems to be a bit of very high values. Perhaps that's normal? How would we find out?

Now let's take a look at some of the algae levels.

In [19]:
ggplot(algae_blooms,aes(x=season,y=a3)) +   # plot a3 by season ...
    geom_boxplot() +                        # in a series of boxplots ...
    xlab("Season") +                        # with x axis as Seasons and y axis as a3
    ylab("Algae Level a3")

What does that tell you? Or is it hard to get a good handle on the situation because the season are out of order?

We can re-arrange the factors, but it requires a bit of fancy footwork using the forcats'library fct_relevel() function, and dplyr's mutate().

In [20]:
library(forcats) # for fct_relevel
library(dplyr)   # for mutate
Attaching package: ‘dplyr’

The following object is masked from ‘package:car’:

    recode

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

In [21]:
# note to change the mutate
algae_blooms$size <- factor(algae_blooms$size,levels = c("small", "medium", "large")) 
algae_blooms$speed <- factor(algae_blooms$speed,levels  = c("low","medium","high")) 
algae_blooms$season <- factor(algae_blooms$season,levels = c("spring","summer","autumn","winter")) 

#algae_blooms = mutate(algae_blooms, 
#                    size=fct_relevel(size,c("small","medium","large")), # factors should appear in the desired order
#                    speed=fct_relevel(speed,c("low","medium","high")),  # ditto
#                    season=fct_relevel(season,c("spring","summer","autumn","winter")) # same here
#                    )
In [22]:
ggplot(algae_blooms,aes(x=season,y=a3)) + 
    geom_boxplot() +
    xlab("Season") +
    ylab("Algae Level a3")

We only have 1 year's worth of data, so it's perhaps too early to tell, but it certainly seems as though the a3 levels decrease from spring to winter.

Violin plots are cousins to the boxplots. Can we get a bit more insight on the a3 trend?

In [23]:
ggplot(algae_blooms,aes(x=season,y=a3)) +  # plot a3 by season ...
    geom_violin() +                        # in a series of violin plots ...  
    geom_jitter() +                        # with some jitter to avoid all the points being on top of one another ...
    xlab("Season") + 
    ylab("Algae Level a3")

This plot certainly seems to suggest that a3 levels are cyclic, with a peak in the spring and low levels in the fall.

(What happens if you turn off the jitter option?)

Let's go back to NH4 for a second to see if we can spot a link with the season (as we did for a3).

We only keep the observations for which the NH4 value is greater than 3000, and we bin them with respect to the quartiles.

This will also give us the chance to demonstrate the usage of the pipeline operator %>%.


First, we filter the algae_blooms dataset to remove the 2 observations with missing values (is this always a good idea?)

In [24]:
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) # %>% 
    # filter(NH4<3000) %>% 
    # mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)
seasonsizespeedmxPHmnO2ClNO3NH4oPO4PO4Chlaa1a2a3a4a5a6a7
winter small medium 8.00 9.8 60.800 6.238 578.000105.000170.00050.0 0.0 0.0 0.0 0.0 34.2 8.3 0.0
spring small medium 8.35 8.0 57.750 1.288 370.000428.750558.750 1.3 1.4 7.6 4.8 1.9 6.7 0.0 2.1
autumn small medium 8.10 11.4 40.020 5.330 346.667125.667187.05715.6 3.3 53.6 1.9 0.0 0.0 0.0 9.7
spring small medium 8.07 4.8 77.364 2.302 98.182 61.182138.700 1.4 3.1 41.0 18.9 0.0 1.4 0.0 1.4
autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.58010.5 9.2 2.9 7.5 0.0 7.5 4.1 1.0
winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.66728.4 15.1 14.6 1.4 0.0 22.5 12.6 2.9
338

Next we remove the 11 observations for which NH4 > 3000 (again, based on the mean + sd "argument")

In [25]:
f.NH4.data <- filter(algae_blooms,!is.na(NH4))  %>% 
     filter(NH4<3000) # %>% 
    # mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)
seasonsizespeedmxPHmnO2ClNO3NH4oPO4PO4Chlaa1a2a3a4a5a6a7
winter small medium 8.00 9.8 60.800 6.238 578.000105.000170.00050.0 0.0 0.0 0.0 0.0 34.2 8.3 0.0
spring small medium 8.35 8.0 57.750 1.288 370.000428.750558.750 1.3 1.4 7.6 4.8 1.9 6.7 0.0 2.1
autumn small medium 8.10 11.4 40.020 5.330 346.667125.667187.05715.6 3.3 53.6 1.9 0.0 0.0 0.0 9.7
spring small medium 8.07 4.8 77.364 2.302 98.182 61.182138.700 1.4 3.1 41.0 18.9 0.0 1.4 0.0 1.4
autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.58010.5 9.2 2.9 7.5 0.0 7.5 4.1 1.0
winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.66728.4 15.1 14.6 1.4 0.0 22.5 12.6 2.9
327

Finally, we add a new variable to tell us in which quartile the NH4 value falls.

In [26]:
f.NH4.data <- filter(algae_blooms,!is.na(NH4))  %>% 
     filter(NH4<3000)  %>% 
     mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)
seasonsizespeedmxPHmnO2ClNO3NH4oPO4PO4Chlaa1a2a3a4a5a6a7q.NH4
winter small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0 0.0 0.0 0.0 0.0 34.2 8.3 0.0 (226,2.17e+03]
spring small medium 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3 1.4 7.6 4.8 1.9 6.7 0.0 2.1 (226,2.17e+03]
autumn small medium 8.10 11.4 40.020 5.330 346.667 125.667 187.057 15.6 3.3 53.6 1.9 0.0 0.0 0.0 9.7 (226,2.17e+03]
spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4 3.1 41.0 18.9 0.0 1.4 0.0 1.4 (36.8,103]
autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5 9.2 2.9 7.5 0.0 7.5 4.1 1.0 (226,2.17e+03]
winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4 15.1 14.6 1.4 0.0 22.5 12.6 2.9 (226,2.17e+03]
327

We can now use the new q.NH4 variable to make multi-variate comparisons, say between a1, NH4, and season.

In [27]:
ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) + # plot the a1 levels by season ...
    geom_point() +
    facet_wrap(~q.NH4) +                             # for each of the quartiles (each plot contains ~25% of the observations)
    guides(color=FALSE) +
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

That's odd... why are we seeing missing values here? Haven't we removed the NAs? Let's delve in a bit deeper.

In [28]:
f.NH4.data[which(is.na(f.NH4.data$q.NH4)),]
table(f.NH4.data$q.NH4,useNA="ifany")
seasonsizespeedmxPHmnO2ClNO3NH4oPO4PO4Chlaa1a2a3a4a5a6a7q.NH4
53springsmall medium5.6 11.8 NA 2.22 5 1 1 NA 82.7 0 0 0 0 0 0 NA
223autumnsmall high 5.9 11.9 NA 1.88 5 1 2 NA 75.2 0 0 0 0 0 0 NA
      (5,36.8]     (36.8,103]      (103,226] (226,2.17e+03]           <NA> 
            80             82             81             82              2 

Ah, it seems as though the quartiles do not include their lower bound. We can remedy the situation by including an additional parameter in the mutate() call.

In [29]:
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>% 
    filter(NH4<3000) %>% 
    mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))
table(f.NH4.data$q.NH4,useNA="ifany")
      [5,36.8]     (36.8,103]      (103,226] (226,2.17e+03] 
            82             82             81             82 
In [30]:
ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    guides(color=FALSE) +
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

The NAs have disappeared from the graph (but the observations have not!). Either way, it seems as though the a1 levels are inversely correlated with the NH4+ levels but that the season doesn't have much of an effect (although we would need more work to conclude with any degree of certainty).

In [31]:
ggplot(f.NH4.data,aes(x=a3,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    guides(color=FALSE) +
    ggtitle("Algae Level a3 by Season and Ammonium Quartiles NH4")

Back to top

DATA CLEANING

We found some potential anomalies in the data when we explored the dataset (although I'm electing to keep them in the dataset as I lack the domain expertise to make a reasonable decision on that front); now let's take some time to clean the data to some extent.

Anomalies come in various flavours; we have already explored some potential outlying behaviour, in this section we'll handle missing values.

In [32]:
library(dplyr)

The function complete.cases lists the observations for which every field is present (note that it says nothing about the validity of the case).

In [33]:
table(complete.cases(algae_blooms))
FALSE  TRUE 
   34   306 

The vast majority of observations do not have missing cases, but a few still do. Let's take a look at them to see if there is anything special about them? Are the values missing completely at random?

In [34]:
str(filter(algae_blooms, !complete.cases(algae_blooms)))
summary(filter(algae_blooms, !complete.cases(algae_blooms)))
'data.frame':	34 obs. of  18 variables:
 $ season: Factor w/ 4 levels "spring","summer",..: 3 1 4 4 1 3 1 2 3 1 ...
 $ size  : Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ speed : Factor w/ 3 levels "low","medium",..: 3 3 1 3 2 2 3 3 2 2 ...
 $ mxPH  : num  6.8 8 NA 6.6 5.6 5.7 6.6 6.6 6.6 6.5 ...
 $ mnO2  : num  11.1 NA 12.6 10.8 11.8 10.8 9.5 10.8 11.3 10.4 ...
 $ Cl    : num  9 1.45 9 NA NA NA NA NA NA NA ...
 $ NO3   : num  0.63 0.81 0.23 3.25 2.22 ...
 $ NH4   : num  20 10 10 10 5 10 20 10 10 10 ...
 $ oPO4  : num  4 2.5 5 1 1 1 1 2 1 2 ...
 $ PO4   : num  NA 3 6 6.5 1 4 6 11 6 14 ...
 $ Chla  : num  2.7 0.3 1.1 NA NA NA NA NA NA NA ...
 $ a1    : num  30.3 75.8 35.5 24.3 82.7 16.8 46.8 46.9 47.1 66.9 ...
 $ a2    : num  1.9 0 0 0 0 4.6 0 0 0 0 ...
 $ a3    : num  0 0 0 0 0 3.9 0 0 0 0 ...
 $ a4    : num  0 0 0 0 0 11.5 28.8 13.4 0 0 ...
 $ a5    : num  2.1 0 0 0 0 0 0 0 0 0 ...
 $ a6    : num  1.4 0 0 0 0 0 0 0 1.2 0 ...
 $ a7    : num  2.1 0 0 0 0 0 0 0 0 0 ...
    season       size       speed         mxPH            mnO2       
 spring: 7   small :26   low   : 4   Min.   :5.600   Min.   : 5.700  
 summer: 6   medium: 1   medium:13   1st Qu.:6.600   1st Qu.: 9.275  
 autumn: 8   large : 7   high  :17   Median :7.225   Median :10.800  
 winter:13                           Mean   :7.344   Mean   :10.192  
                                     3rd Qu.:8.000   3rd Qu.:11.300  
                                     Max.   :9.700   Max.   :12.600  
                                     NA's   :2       NA's   :2       
       Cl              NO3               NH4              oPO4        
 Min.   : 0.222   Min.   : 0.2300   Min.   :  5.00   Min.   :  1.000  
 1st Qu.: 4.562   1st Qu.: 0.8025   1st Qu.: 10.00   1st Qu.:  1.000  
 Median : 9.027   Median : 1.4440   Median : 11.84   Median :  3.667  
 Mean   :19.312   Mean   : 2.1684   Mean   : 62.03   Mean   : 25.676  
 3rd Qu.:25.238   3rd Qu.: 2.5725   3rd Qu.: 46.38   3rd Qu.: 20.250  
 Max.   :71.000   Max.   :11.0200   Max.   :500.00   Max.   :295.667  
 NA's   :16       NA's   :2         NA's   :2        NA's   :2        
      PO4              Chla             a1              a2        
 Min.   :  1.00   Min.   : 0.30   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:  6.00   1st Qu.: 1.78   1st Qu.:16.80   1st Qu.: 0.000  
 Median : 10.83   Median : 4.00   Median :30.30   Median : 0.000  
 Mean   : 34.58   Mean   :13.97   Mean   :36.02   Mean   : 4.503  
 3rd Qu.: 19.23   3rd Qu.:12.22   3rd Qu.:54.38   3rd Qu.: 3.400  
 Max.   :380.00   Max.   :68.05   Max.   :83.00   Max.   :36.500  
 NA's   :7        NA's   :23                                      
       a3               a4               a5               a6        
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 0.000   Median : 0.000   Median : 0.000   Median : 0.000  
 Mean   : 1.418   Mean   : 2.335   Mean   : 1.544   Mean   : 1.138  
 3rd Qu.: 1.125   3rd Qu.: 1.975   3rd Qu.: 0.900   3rd Qu.: 0.000  
 Max.   :14.600   Max.   :28.800   Max.   :21.100   Max.   :14.500  
                                                                    
       a7        
 Min.   : 0.000  
 1st Qu.: 0.000  
 Median : 0.000  
 Mean   : 1.676  
 3rd Qu.: 1.575  
 Max.   :28.000  
                 

Right off the bat, missing cases seem to be over-represented in small rivers and under-represented in low-speed rivers. But upon further investigation (that is, comparing with the original dataset), the under-representation of low-speed rivers is not problematic as it is in-line with the numbers in the larger dataset (by which I mean that low-speed rivers don't seem to have a systematic missing value problem, rather than the under-representation of low-speed rivers in the original dataset is not problematic).

I will assume for now (in the interest of efficiency) that the fact that small rivers have a lot of missing cases (mostly Cl and Chla) is also not a problem (YOU SHOULD PROBABLY VERIFY IF THAT IS INDEED THE CASE...)

And the bulk of the missing values seem to come from either Cl, Chla, or PO4. There also seems to be 2 observations for which we have very little information, but we can't use that smmary to determine if it's always the same two observations.

For instance, let's see what observations have missing NH4 values.

In [35]:
algae_blooms[which(is.na(algae_blooms$NH4)),]
seasonsizespeedmxPHmnO2ClNO3NH4oPO4PO4Chlaa1a2a3a4a5a6a7
62summersmall medium6.4 NA NA NA NA NA 14 NA 19.4 0.0 0.0 2 0 3.9 1.7
199winterlarge medium8.0 7.6 NA NA NA NA NA NA 0.0 12.5 3.7 1 0 0.0 4.9

While these observations also have missing values in other fields, there do have some non-missing fields as well.

But they're both missing 6 of the predictor variables. How useful could they be in training a predictive model? (that depends on the model, of course....)

We can write a functio nthat will compute how many missing cases there are for each observations.

In [36]:
table(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns
  0   1   2   6 
306  20  12   2 
In [37]:
which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2)
  1. 62
  2. 199

Most observations have no missing cases, which is great, and there are a few with 1 or 2, but observations 62 and 199 are wild cards, with 6 missing predictors (out of 11).

Based on the small number of such wild cards, we elect to remove them from the analysis.

IMPORTANT NOTES:

  • If you decide to remove observations for any reason whatsoever, you need to document the process that lead you to eliminate it, and make that available.
  • Why do we remove the ones with 6 missing cases, but not the ones with 2 missing cases? If there'd been observations with 4 missing cases, what would you have done? What factors influence your decision?

Our new dataset still contains observations with missing cases, however.

In [38]:
algae_blooms.sna = algae_blooms[-which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2),]
dim(algae_blooms.sna)
  1. 338
  2. 18

What can we do with the other observations for which values are missing?

One possibility is to use the set of complete observations to compute a correlation matrix, and to see if any numerical field is strongly correlated with annother field. That way, if there is a missing value in the first field, the second could be used to impute it.

IMPORTANT NOTE:

  • this approach only works for variables that are linearly correlated to a single other variable. Non-linear correlations and multi-variate associations will not be uncovered.
In [39]:
symnum(cor(algae_blooms.sna[,4:18], use="complete.obs"))
     mP mO Cl NO NH o P Ch a1 a2 a3 a4 a5 a6 a7
mxPH 1                                         
mnO2    1                                      
Cl         1                                   
NO3           1                                
NH4           .  1                             
oPO4             .  1                          
PO4     .  .     .  * 1                        
Chla .                  1                      
a1         .        . .    1                   
a2                            1                
a3                               1             
a4      .        .  . .             1          
a5                                     1       
a6            .                        .  1    
a7                                           1 
attr(,"legend")
[1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1

This might be a bit hard to read. We can use the corrplot library to visualize the (linear) correlations.

In [40]:
library(corrplot)
In [41]:
corrplot(cor(algae_blooms.sna[,4:18], use="complete.obs"),type="upper",tl.pos="d")

The correlation between PO4 (which has missing cases) and oPO4 (which does not, anymore) is clear.

What is the nature of the relation? We'll use the set of complete cases to find it.

In [42]:
algae_blooms.nona <- algae_blooms.sna[-which(apply(algae_blooms.sna,1, function(x) sum(is.na(x)))>0),]
In [43]:
dim(algae_blooms.nona)
  1. 306
  2. 18
In [44]:
PO4.oPO4.model = lm(PO4 ~ oPO4, data=algae_blooms.nona)
PO4.oPO4.model
Call:
lm(formula = PO4 ~ oPO4, data = algae_blooms.nona)

Coefficients:
(Intercept)         oPO4  
     51.811        1.203  

The regression function is PO4$=51.811+1.203$oPO4 (note that I'm not really interested in the fit statistics).

In [45]:
Intercept = PO4.oPO4.model$coefficients[[1]]
Slope = PO4.oPO4.model$coefficients[[2]]

What are the observations for which PO4 is missing?

In [46]:
which(is.na(algae_blooms.sna$PO4)==TRUE)
  1. 28
  2. 221
  3. 291
  4. 326
  5. 331
  6. 335

Let's use the regression function to impute the missing PO4 values.

In [47]:
algae_blooms.sna2 <- algae_blooms.sna
algae_blooms.sna2$PO4 <- ifelse(is.na(algae_blooms.sna2$PO4),max(Intercept + Slope*algae_blooms.sna2$oPO4,0),algae_blooms.sna2$PO4)

We can clearly see that no values of PO4 are missing anymore.

In [48]:
which(is.na(algae_blooms.sna2$PO4)==TRUE)

That takes care of the missing values with strong linear correlation to another field. Where do we stand now?

In [49]:
summary(algae_blooms.sna2)
    season       size        speed          mxPH            mnO2       
 spring:84   small :120   low   : 58   Min.   :5.600   Min.   : 1.500  
 summer:85   medium:136   medium:138   1st Qu.:7.750   1st Qu.: 8.000  
 autumn:80   large : 82   high  :142   Median :8.055   Median : 9.700  
 winter:89                             Mean   :8.002   Mean   : 9.161  
                                       3rd Qu.:8.400   3rd Qu.:10.800  
                                       Max.   :9.700   Max.   :13.400  
                                       NA's   :2       NA's   :1       
       Cl               NO3              NH4                oPO4        
 Min.   :  0.222   Min.   : 0.000   Min.   :    5.00   Min.   :   1.00  
 1st Qu.: 10.994   1st Qu.: 1.147   1st Qu.:   37.86   1st Qu.:  13.00  
 Median : 32.470   Median : 2.356   Median :  107.36   Median :  37.24  
 Mean   : 42.517   Mean   : 3.121   Mean   :  471.73   Mean   :  73.09  
 3rd Qu.: 57.750   3rd Qu.: 4.147   3rd Qu.:  244.90   3rd Qu.:  88.11  
 Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :1435.00  
 NA's   :14                                                             
      PO4               Chla               a1              a2        
 Min.   :   1.00   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:  40.75   1st Qu.:  2.133   1st Qu.: 1.50   1st Qu.: 0.000  
 Median : 104.86   Median :  5.111   Median : 7.10   Median : 2.800  
 Mean   : 166.18   Mean   : 12.796   Mean   :16.74   Mean   : 7.207  
 3rd Qu.: 206.12   3rd Qu.: 17.200   3rd Qu.:25.32   3rd Qu.:10.025  
 Max.   :1777.93   Max.   :110.456   Max.   :89.80   Max.   :72.600  
                   NA's   :21                                        
       a3               a4               a5               a6        
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 1.400   Median : 0.000   Median : 2.200   Median : 0.000  
 Mean   : 3.917   Mean   : 1.812   Mean   : 5.548   Mean   : 6.438  
 3rd Qu.: 4.675   3rd Qu.: 2.300   3rd Qu.: 8.000   3rd Qu.: 7.075  
 Max.   :42.800   Max.   :44.600   Max.   :61.100   Max.   :77.600  
                                                                    
       a7      
 Min.   : 0.0  
 1st Qu.: 0.0  
 Median : 0.0  
 Mean   : 2.2  
 3rd Qu.: 2.2  
 Max.   :31.6  
               

Alright, so we don't have as many missing values as before, but we still have some. And the correlation trick isn't going to work this time.

What can we do? There are many ways to tackle the problem, but we will use $k$NN imputation. The principle is simple. Using some similarity/distance metric (typically based on the Euclidean distance between points), we start by identifying the $k$ nearest (complete) neighbours of each observation with a missing case. We then compute the mean value of the missing case in the $k-$group of complete observations, and use that value as the imputed value.

IMPORTANT NOTES:

  • as we have seen when we were discussing, we often suggest scaling the data when dealing with distance metrics. I elected not to scale the data explicitely here. How much of an effect can that have?
  • we are going to be using DMwRs implementation of knnImputation() (below). How would you go about determining if the routine scales the data internally?
In [50]:
library(DMwR)
Loading required package: lattice
Loading required package: grid
In [51]:
algae_blooms.sna2 <- knnImputation(algae_blooms.sna2,k=10) # the choice of k=10 is arbitrary here.

Sure enough, there are no further observations with incomplete cases.

In [52]:
summary(algae_blooms.sna2)
table(apply(algae_blooms.sna2,1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns
    season       size        speed          mxPH            mnO2       
 spring:84   small :120   low   : 58   Min.   :5.600   Min.   : 1.500  
 summer:85   medium:136   medium:138   1st Qu.:7.755   1st Qu.: 8.000  
 autumn:80   large : 82   high  :142   Median :8.045   Median : 9.750  
 winter:89                             Mean   :8.001   Mean   : 9.165  
                                       3rd Qu.:8.393   3rd Qu.:10.800  
                                       Max.   :9.700   Max.   :13.400  
       Cl               NO3              NH4                oPO4        
 Min.   :  0.222   Min.   : 0.000   Min.   :    5.00   Min.   :   1.00  
 1st Qu.: 10.581   1st Qu.: 1.147   1st Qu.:   37.86   1st Qu.:  13.00  
 Median : 31.233   Median : 2.356   Median :  107.36   Median :  37.24  
 Mean   : 41.326   Mean   : 3.121   Mean   :  471.73   Mean   :  73.09  
 3rd Qu.: 57.323   3rd Qu.: 4.147   3rd Qu.:  244.90   3rd Qu.:  88.11  
 Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :1435.00  
      PO4               Chla               a1              a2        
 Min.   :   1.00   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:  40.75   1st Qu.:  2.100   1st Qu.: 1.50   1st Qu.: 0.000  
 Median : 104.86   Median :  5.042   Median : 7.10   Median : 2.800  
 Mean   : 166.18   Mean   : 12.328   Mean   :16.74   Mean   : 7.207  
 3rd Qu.: 206.12   3rd Qu.: 16.021   3rd Qu.:25.32   3rd Qu.:10.025  
 Max.   :1777.93   Max.   :110.456   Max.   :89.80   Max.   :72.600  
       a3               a4               a5               a6        
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 1.400   Median : 0.000   Median : 2.200   Median : 0.000  
 Mean   : 3.917   Mean   : 1.812   Mean   : 5.548   Mean   : 6.438  
 3rd Qu.: 4.675   3rd Qu.: 2.300   3rd Qu.: 8.000   3rd Qu.: 7.075  
 Max.   :42.800   Max.   :44.600   Max.   :61.100   Max.   :77.600  
       a7      
 Min.   : 0.0  
 1st Qu.: 0.0  
 Median : 0.0  
 Mean   : 2.2  
 3rd Qu.: 2.2  
 Max.   :31.6  
  0 
338 

Back to top

PREDICTION MODELS

For supervised learning tasks, the bias-variance trade-off means that we need to set aside a testing set on which the models (which learned on the training set) are evaluated.

There are no hard-and-fast rules to determine how to split the data - if the signal is very strong, a small training set should capture its important features, but we don't typically know how strong the signal is before we start the modeling process. On the other hand, if the training set is too large, we run the risk of overfitting the data.

We've elected to use a 65%/35% split in favour of the training set.

The training data should also be representative, to avoid injecting biases in the model (in case the data was provided according to some systematic but unknown rule). There are numerous ways to do this (see any introduction to sampling design), but we can do so using a simple random sample of 218 observations (we could also have stratified according to season, size, etc.).

To avoid issues due related to repeatability, we'll all use the same training set; I've commented out the code that would allow you to sample randomly.

In [53]:
#ind<-sample(1:dim(algae_blooms.sna2)[1],218, replace=FALSE)
ind <- 1:218
218/338
0.644970414201183
In [54]:
algae.train<-algae_blooms.sna2[ind,]
algae.test<-algae_blooms.sna2[-ind,]
In [55]:
dim(algae.train)
dim(algae.test)
  1. 218
  2. 18
  1. 120
  2. 18

GENERALIZED LINEAR MODEL

Before we get too excited about using various machine learning methods, let's see what the traditional approach yields.

We'll run a linear model to predict a2, for example, against all the predictor variables, but using only the training set as data.

In [56]:
linear.model.a2 <- lm(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,data=algae.train)

A summary of the results can be given by calling the summary method on the resulting object.

In [57]:
summary(linear.model.a2)
Call:
lm(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.436  -5.281  -2.613   2.026  62.712 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.083e+01  1.257e+01  -2.452 0.015056 *  
seasonsummer -1.166e-01  2.112e+00  -0.055 0.956035    
seasonautumn  1.071e+00  2.370e+00   0.452 0.651934    
seasonwinter -1.451e+00  2.000e+00  -0.726 0.468935    
sizemedium   -2.628e+00  1.895e+00  -1.387 0.166896    
sizelarge    -3.210e+00  2.412e+00  -1.331 0.184767    
speedmedium   3.887e+00  2.485e+00   1.564 0.119325    
speedhigh    -1.104e+00  2.772e+00  -0.398 0.690751    
mxPH          4.859e+00  1.559e+00   3.117 0.002092 ** 
mnO2         -1.841e-01  3.924e-01  -0.469 0.639474    
Cl           -7.432e-03  2.006e-02  -0.371 0.711351    
NO3           2.132e-01  3.028e-01   0.704 0.482249    
NH4          -5.979e-04  5.355e-04  -1.117 0.265510    
oPO4          2.290e-03  9.876e-03   0.232 0.816875    
PO4          -1.559e-03  5.936e-03  -0.263 0.793090    
Chla          1.652e-01  4.614e-02   3.579 0.000432 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.74 on 202 degrees of freedom
Multiple R-squared:  0.206,	Adjusted R-squared:  0.147 
F-statistic: 3.493 on 15 and 202 DF,  p-value: 2.498e-05

We see that the adjusted $R^2$ coefficient is fairly small. Furthermore, if the linear model is a good fit, the residuals should have a mean of 0 and be as "small". That being said, the $F-$statistic seems to indicate that there is some (linear) dependence on the predictor variables.

We can get a better handle on the regression diagnostics by calling the plot() method on the object.

In [58]:
plot(linear.model.a2)

All in all, the linear model is nowhere near a great one, let alone a good one.

The significance of some of the coefficients is questionable, however, and we might wonder what effect their inclusion might have.

In [59]:
anova(linear.model.a2)
DfSum SqMean SqF valuePr(>F)
season 3 1.122589e+02 37.41963423.242880e-018.078029e-01
size 2 4.359809e+02 217.99045371.889160e+001.538604e-01
speed 2 1.552841e+03 776.42033226.728654e+001.482532e-03
mxPH 1 2.223541e+032223.54133421.926977e+011.828507e-05
mnO2 1 5.405028e-01 0.54050284.684133e-039.455025e-01
Cl 1 3.319361e-01 0.33193612.876642e-039.572795e-01
NO3 1 4.391250e+01 43.91249913.805568e-015.380001e-01
NH4 1 1.938211e+02 193.82110631.679703e+001.964428e-01
oPO4 1 9.138940e-02 0.09138947.920036e-049.775762e-01
PO4 1 4.816856e+00 4.81685614.174409e-028.383141e-01
Chla 1 1.478183e+031478.18338571.281031e+014.316486e-04
Residuals202 2.330881e+04 115.3901334 NA NA

An ANOVA finds that NH4 least contributes to the prediction of a2, and we might be interested in the results of a linear regression with that predictor removed.

In [60]:
linear.model.a2.mod <- update(linear.model.a2, . ~ . - NH4)
In [61]:
summary(linear.model.a2.mod)
Call:
lm(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + oPO4 + PO4 + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.801  -5.500  -2.647   2.504  63.259 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -30.996221  12.580354  -2.464 0.014577 *  
seasonsummer  -0.107996   2.113697  -0.051 0.959301    
seasonautumn   0.806683   2.359593   0.342 0.732799    
seasonwinter  -1.397244   2.000275  -0.699 0.485648    
sizemedium    -2.378831   1.882645  -1.264 0.207838    
sizelarge     -3.086404   2.411377  -1.280 0.202029    
speedmedium    3.637403   2.476278   1.469 0.143408    
speedhigh     -1.382060   2.762133  -0.500 0.617364    
mxPH           4.821140   1.559264   3.092 0.002268 ** 
mnO2          -0.074216   0.380118  -0.195 0.845398    
Cl            -0.001602   0.019376  -0.083 0.934181    
NO3           -0.013968   0.224437  -0.062 0.950437    
oPO4          -0.001285   0.009348  -0.137 0.890775    
PO4           -0.001518   0.005940  -0.256 0.798576    
Chla           0.165865   0.046166   3.593 0.000411 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.75 on 203 degrees of freedom
Multiple R-squared:  0.2011,	Adjusted R-squared:  0.146 
F-statistic: 3.649 on 14 and 203 DF,  p-value: 2.009e-05

The fit isn't much better, but an ANOVA on the 2 suggested models shows that we're ~88% that the models are different (see below).

In [62]:
anova(linear.model.a2,linear.model.a2.mod)
Res.DfRSSDfSum of SqFPr(>F)
202 23308.81 NA NA NA NA
203 23452.66 -1 -143.85671.246699 0.2655103

The function step uses AIC to perform a model search (using backward elimination), implemented in the method step().

In [63]:
final.linear.model.a2 <- step(linear.model.a2)
Start:  AIC=1050.52
a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
    PO4 + Chla

         Df Sum of Sq   RSS    AIC
- season  3    157.44 23466 1046.0
- oPO4    1      6.20 23315 1048.6
- PO4     1      7.96 23317 1048.6
- Cl      1     15.85 23325 1048.7
- mnO2    1     25.40 23334 1048.8
- NO3     1     57.19 23366 1049.0
- size    2    282.28 23591 1049.1
- NH4     1    143.86 23453 1049.9
<none>                23309 1050.5
- speed   2    967.47 24276 1055.4
- mxPH    1   1121.22 24430 1058.8
- Chla    1   1478.18 24787 1061.9

Step:  AIC=1045.98
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + 
    Chla

        Df Sum of Sq   RSS    AIC
- oPO4   1      2.54 23469 1044.0
- PO4    1      4.10 23470 1044.0
- mnO2   1      6.61 23473 1044.0
- Cl     1     15.59 23482 1044.1
- size   2    257.60 23724 1044.4
- NO3    1     47.04 23513 1044.4
- NH4    1    114.06 23580 1045.0
<none>               23466 1046.0
- speed  2   1035.56 24502 1051.4
- mxPH   1   1052.01 24518 1053.5
- Chla   1   1477.06 24943 1057.3

Step:  AIC=1044.01
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + PO4 + Chla

        Df Sum of Sq   RSS    AIC
- PO4    1      1.62 23470 1042.0
- mnO2   1      7.17 23476 1042.1
- Cl     1     14.19 23483 1042.1
- NO3    1     44.93 23514 1042.4
- size   2    266.73 23736 1042.5
- NH4    1    114.91 23584 1043.1
<none>               23469 1044.0
- speed  2   1050.55 24519 1049.5
- mxPH   1   1099.78 24569 1052.0
- Chla   1   1480.47 24949 1055.3

Step:  AIC=1042.02
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- mnO2   1      6.59 23477 1040.1
- Cl     1     17.42 23488 1040.2
- size   2    265.19 23736 1040.5
- NO3    1     51.04 23521 1040.5
- NH4    1    140.72 23611 1041.3
<none>               23470 1042.0
- speed  2   1050.42 24521 1047.6
- mxPH   1   1105.21 24576 1050.0
- Chla   1   1482.34 24953 1053.4

Step:  AIC=1040.08
a2 ~ size + speed + mxPH + Cl + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- Cl     1     13.41 23490 1038.2
- size   2    260.65 23738 1038.5
- NO3    1     44.48 23522 1038.5
- NH4    1    135.66 23613 1039.3
<none>               23477 1040.1
- speed  2   1121.64 24599 1046.3
- mxPH   1   1103.17 24580 1048.1
- Chla   1   1492.55 24970 1051.5

Step:  AIC=1038.21
a2 ~ size + speed + mxPH + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- NO3    1     36.13 23526 1036.5
- size   2    275.91 23766 1036.8
- NH4    1    128.31 23619 1037.4
<none>               23490 1038.2
- speed  2   1172.78 24663 1044.8
- mxPH   1   1089.85 24580 1046.1
- Chla   1   1490.94 24981 1049.6

Step:  AIC=1036.54
a2 ~ size + speed + mxPH + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- size   2    244.91 23771 1034.8
- NH4    1     93.48 23620 1035.4
<none>               23526 1036.5
- speed  2   1164.36 24691 1043.1
- mxPH   1   1053.88 24580 1044.1
- Chla   1   1611.04 25138 1049.0

Step:  AIC=1034.8
a2 ~ speed + mxPH + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- NH4    1     82.62 23854 1033.6
<none>               23771 1034.8
- mxPH   1    850.56 24622 1040.5
- speed  2   1085.45 24857 1040.5
- Chla   1   1540.50 25312 1046.5

Step:  AIC=1033.56
a2 ~ speed + mxPH + Chla

        Df Sum of Sq   RSS    AIC
<none>               23854 1033.6
- speed  2   1021.27 24875 1038.7
- mxPH   1    928.72 24783 1039.9
- Chla   1   1479.59 25334 1044.7

The "best" linear model for a2 is shown below.

In [64]:
summary(final.linear.model.a2)
Call:
lm(formula = a2 ~ speed + mxPH + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.195  -6.008  -2.530   2.024  63.589 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -27.13270   11.07921  -2.449 0.015134 *  
speedmedium   4.17176    2.34330   1.780 0.076453 .  
speedhigh    -0.32929    2.41899  -0.136 0.891850    
mxPH          3.89794    1.35358   2.880 0.004387 ** 
Chla          0.15945    0.04387   3.635 0.000349 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.58 on 213 degrees of freedom
Multiple R-squared:  0.1874,	Adjusted R-squared:  0.1721 
F-statistic: 12.28 on 4 and 213 DF,  p-value: 5.289e-09

It's still not a great fit (adjusted $R^2$ is quite small), so the linear model is not ideal for a2.

In [65]:
anova(final.linear.model.a2)
DfSum SqMean SqF valuePr(>F)
speed 2 1994.841 997.4206 8.906264 0.0001928875
mxPH 1 2026.632 2026.6317 18.096394 0.0000314492
Chla 1 1479.589 1479.5892 13.211690 0.0003488051
Residuals213 23854.064 111.9909 NA NA
In [66]:
anova(linear.model.a2,final.linear.model.a2)
Res.DfRSSDfSum of SqFPr(>F)
202 23308.81 NA NA NA NA
213 23854.06 -11 -545.25730.42957610.9416056
In [67]:
plot(final.linear.model.a2)

REGRESSION TREE MODEL

An alternative to regression is the use of regression trees, an implementation of which is found in the library rpart. The syntax is similar to lm.

In [68]:
library(rpart)
In [69]:
regression.tree.a2 <-rpart(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train)

The outcome can be displayed by calling the object directly.

In [70]:
regression.tree.a2
n= 218 

node), split, n, deviance, yval
      * denotes terminal node

  1) root 218 29355.1300  7.6366970  
    2) Cl< 16.6875 83  1193.6400  1.8891570  
      4) size=small,medium 67   398.6457  0.9447761 *
      5) size=large 16   485.0194  5.8437500 *
    3) Cl>=16.6875 135 23733.9200 11.1703700  
      6) mxPH< 8.065 59  3831.8290  5.3864410  
       12) season=autumn,winter 29   561.8414  2.5172410 *
       13) season=spring,summer 30  2800.4720  8.1600000  
         26) mxPH< 7.9375 23   889.9730  5.3173910 *
         27) mxPH>=7.9375 7  1114.0000 17.5000000 *
      7) mxPH>=8.065 76 16396.0400 15.6605300  
       14) Chla>=2.65 68  9694.0890 13.8544100  
         28) Chla< 14.8875 29  2747.5810  8.7172410  
           56) NH4< 226.875 21   558.4257  5.7857140 *
           57) NH4>=226.875 8  1534.9490 16.4125000 *
         29) Chla>=14.8875 39  5612.0940 17.6743600  
           58) mnO2< 11.05 30  3139.0940 15.4233300  
            116) NH4>=158.409 8   577.1000  8.9000000 *
            117) NH4< 158.409 22  2097.7700 17.7954500  
              234) season=spring,autumn 14   674.7521 14.6642900 *
              235) season=summer,winter 8  1045.5550 23.2750000 *
           59) mnO2>=11.05 9  1814.2760 25.1777800 *
       15) Chla< 2.65 8  4594.6690 31.0125000 *

The tree structure can be hard to determine when there is a large number of nodes; let's see if we can improve on the visuals using rpart.plot.

In [71]:
library(rpart.plot)
In [72]:
prp(regression.tree.a2,extra=101,box.col="orange",split.box.col="gray")

Details on the regression tree can be obtained by calling the summary() method on the object.

In [73]:
summary(regression.tree.a2)
Call:
rpart(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)
  n= 218 

           CP nsplit rel error    xerror      xstd
1  0.15082765      0 1.0000000 1.0114523 0.1997920
2  0.11943572      1 0.8491724 0.9748608 0.1731060
3  0.07178590      2 0.7297366 0.9198268 0.1672873
4  0.04545758      3 0.6579507 0.9334977 0.1700251
5  0.02243987      4 0.6124932 0.9742826 0.1806729
6  0.02228595      5 0.5900533 0.9925243 0.1909517
7  0.02156378      6 0.5677673 0.9722725 0.1712347
8  0.01581407      8 0.5246398 0.9988537 0.1815813
9  0.01285848      9 0.5088257 0.9850797 0.1799236
10 0.01055949     10 0.4959672 1.0042841 0.1813917
11 0.01000000     11 0.4854077 0.9953589 0.1813778

Variable importance
  Chla    NH4     Cl   mxPH   oPO4    PO4    NO3  speed   mnO2 season   size 
    19     14     14     13     11      9      6      5      4      3      2 

Node number 1: 218 observations,    complexity param=0.1508276
  mean=7.636697, MSE=134.6565 
  left son=2 (83 obs) right son=3 (135 obs)
  Primary splits:
      Cl   < 16.6875  to the left,  improve=0.15082760, (0 missing)
      mxPH < 7.94     to the left,  improve=0.14900670, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.11106510, (0 missing)
      Chla < 12.21    to the left,  improve=0.09817759, (0 missing)
      NH4  < 46.35    to the left,  improve=0.09032131, (0 missing)
  Surrogate splits:
      PO4   < 70.465   to the left,  agree=0.844, adj=0.590, (0 split)
      oPO4  < 19.8635  to the left,  agree=0.835, adj=0.566, (0 split)
      NH4   < 46.35    to the left,  agree=0.807, adj=0.494, (0 split)
      Chla  < 2.225    to the left,  agree=0.807, adj=0.494, (0 split)
      speed splits as  RRL,          agree=0.775, adj=0.410, (0 split)

Node number 2: 83 observations,    complexity param=0.01055949
  mean=1.889157, MSE=14.38121 
  left son=4 (67 obs) right son=5 (16 obs)
  Primary splits:
      size  splits as  LLR,          improve=0.25968900, (0 missing)
      Chla  < 4.4835   to the left,  improve=0.16315540, (0 missing)
      mxPH  < 7.835    to the left,  improve=0.07327114, (0 missing)
      oPO4  < 18.889   to the left,  improve=0.05858864, (0 missing)
      speed splits as  LRL,          improve=0.05534763, (0 missing)
  Surrogate splits:
      speed splits as  LRL,          agree=0.843, adj=0.188, (0 split)
      mxPH  < 8.385    to the left,  agree=0.831, adj=0.125, (0 split)
      mnO2  < 8.3      to the right, agree=0.831, adj=0.125, (0 split)
      Cl    < 15.2705  to the left,  agree=0.819, adj=0.062, (0 split)

Node number 3: 135 observations,    complexity param=0.1194357
  mean=11.17037, MSE=175.8068 
  left son=6 (59 obs) right son=7 (76 obs)
  Primary splits:
      mxPH < 8.065    to the left,  improve=0.14772320, (0 missing)
      NO3  < 0.3785   to the right, improve=0.09864039, (0 missing)
      Chla < 2.85     to the right, improve=0.06998520, (0 missing)
      mnO2 < 11.05    to the left,  improve=0.06691501, (0 missing)
      NH4  < 474.1875 to the right, improve=0.05884159, (0 missing)
  Surrogate splits:
      NH4  < 272.2915 to the right, agree=0.689, adj=0.288, (0 split)
      Chla < 11.65    to the left,  agree=0.667, adj=0.237, (0 split)
      NO3  < 5.37     to the right, agree=0.659, adj=0.220, (0 split)
      mnO2 < 6.55     to the left,  agree=0.622, adj=0.136, (0 split)
      oPO4 < 279.5085 to the right, agree=0.607, adj=0.102, (0 split)

Node number 4: 67 observations
  mean=0.9447761, MSE=5.949935 

Node number 5: 16 observations
  mean=5.84375, MSE=30.31371 

Node number 6: 59 observations,    complexity param=0.02156378
  mean=5.386441, MSE=64.94626 
  left son=12 (29 obs) right son=13 (30 obs)
  Primary splits:
      season splits as  RRLL,         improve=0.12253050, (0 missing)
      NH4    < 339.9305 to the right, improve=0.08821754, (0 missing)
      mxPH   < 7.905    to the left,  improve=0.07581077, (0 missing)
      Chla   < 2.5      to the right, improve=0.05654547, (0 missing)
      oPO4   < 25.75    to the left,  improve=0.05324680, (0 missing)
  Surrogate splits:
      NO3  < 2.8605   to the right, agree=0.712, adj=0.414, (0 split)
      mnO2 < 9.75     to the right, agree=0.661, adj=0.310, (0 split)
      oPO4 < 144.1885 to the left,  agree=0.627, adj=0.241, (0 split)
      Chla < 6.85     to the right, agree=0.627, adj=0.241, (0 split)
      mxPH < 7.975    to the right, agree=0.610, adj=0.207, (0 split)

Node number 7: 76 observations,    complexity param=0.0717859
  mean=15.66053, MSE=215.7374 
  left son=14 (68 obs) right son=15 (8 obs)
  Primary splits:
      Chla < 2.65     to the right, improve=0.12852400, (0 missing)
      mnO2 < 11.05    to the left,  improve=0.10098910, (0 missing)
      NO3  < 0.6045   to the right, improve=0.08727846, (0 missing)
      Cl   < 35.705   to the right, improve=0.07266453, (0 missing)
      mxPH < 8.375    to the right, improve=0.04660101, (0 missing)
  Surrogate splits:
      Cl   < 19.11    to the right, agree=0.921, adj=0.25, (0 split)
      NO3  < 0.076    to the right, agree=0.921, adj=0.25, (0 split)
      NH4  < 2713     to the left,  agree=0.921, adj=0.25, (0 split)
      oPO4 < 8        to the right, agree=0.921, adj=0.25, (0 split)
      PO4  < 40.3335  to the right, agree=0.921, adj=0.25, (0 split)

Node number 12: 29 observations
  mean=2.517241, MSE=19.37384 

Node number 13: 30 observations,    complexity param=0.02156378
  mean=8.16, MSE=93.34907 
  left son=26 (23 obs) right son=27 (7 obs)
  Primary splits:
      mxPH < 7.9375   to the left,  improve=0.28441600, (0 missing)
      NH4  < 119.2855 to the left,  improve=0.10978520, (0 missing)
      oPO4 < 42.857   to the left,  improve=0.06038492, (0 missing)
      Cl   < 37.624   to the left,  improve=0.05321153, (0 missing)
      Chla < 3.1      to the right, improve=0.04623221, (0 missing)
  Surrogate splits:
      NO3 < 1.0425   to the right, agree=0.833, adj=0.286, (0 split)

Node number 14: 68 observations,    complexity param=0.04545758
  mean=13.85441, MSE=142.5601 
  left son=28 (29 obs) right son=29 (39 obs)
  Primary splits:
      Chla < 14.8875  to the left,  improve=0.13765220, (0 missing)
      Cl   < 68.875   to the right, improve=0.10411670, (0 missing)
      NH4  < 228.175  to the left,  improve=0.09585894, (0 missing)
      NO3  < 5.247    to the left,  improve=0.08016472, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.05486673, (0 missing)
  Surrogate splits:
      size   splits as  LRR,          agree=0.706, adj=0.310, (0 split)
      mxPH   < 8.215    to the left,  agree=0.676, adj=0.241, (0 split)
      oPO4   < 180.8335 to the right, agree=0.662, adj=0.207, (0 split)
      NO3    < 3.779    to the right, agree=0.647, adj=0.172, (0 split)
      season splits as  RLRR,         agree=0.632, adj=0.138, (0 split)

Node number 15: 8 observations
  mean=31.0125, MSE=574.3336 

Node number 26: 23 observations
  mean=5.317391, MSE=38.69448 

Node number 27: 7 observations
  mean=17.5, MSE=159.1429 

Node number 28: 29 observations,    complexity param=0.02228595
  mean=8.717241, MSE=94.74419 
  left son=56 (21 obs) right son=57 (8 obs)
  Primary splits:
      NH4  < 226.875  to the left,  improve=0.23810280, (0 missing)
      Cl   < 68.875   to the right, improve=0.12545720, (0 missing)
      NO3  < 3.8045   to the right, improve=0.10340650, (0 missing)
      mnO2 < 9.6      to the right, improve=0.08307998, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.07494295, (0 missing)
  Surrogate splits:
      Chla < 12.08    to the left,  agree=0.862, adj=0.5, (0 split)

Node number 29: 39 observations,    complexity param=0.02243987
  mean=17.67436, MSE=143.8999 
  left son=58 (30 obs) right son=59 (9 obs)
  Primary splits:
      mnO2 < 11.05    to the left,  improve=0.11737600, (0 missing)
      mxPH < 8.35     to the right, improve=0.11494980, (0 missing)
      Cl   < 51.6665  to the right, improve=0.11368430, (0 missing)
      NO3  < 5.0885   to the left,  improve=0.10842770, (0 missing)
      NH4  < 231.3    to the left,  improve=0.08663984, (0 missing)
  Surrogate splits:
      NH4  < 245.682  to the left,  agree=0.821, adj=0.222, (0 split)
      oPO4 < 19.4585  to the right, agree=0.821, adj=0.222, (0 split)
      PO4  < 96.283   to the right, agree=0.821, adj=0.222, (0 split)
      Chla < 16.4     to the right, agree=0.821, adj=0.222, (0 split)
      size splits as  RLL,          agree=0.795, adj=0.111, (0 split)

Node number 56: 21 observations
  mean=5.785714, MSE=26.5917 

Node number 57: 8 observations
  mean=16.4125, MSE=191.8686 

Node number 58: 30 observations,    complexity param=0.01581407
  mean=15.42333, MSE=104.6365 
  left son=116 (8 obs) right son=117 (22 obs)
  Primary splits:
      NH4  < 158.409  to the right, improve=0.14788480, (0 missing)
      Chla < 48.15    to the left,  improve=0.13800510, (0 missing)
      mnO2 < 9.95     to the right, improve=0.12649850, (0 missing)
      NO3  < 0.698    to the right, improve=0.11199640, (0 missing)
      oPO4 < 144.0415 to the left,  improve=0.08054276, (0 missing)
  Surrogate splits:
      mxPH < 8.27     to the left,  agree=0.767, adj=0.125, (0 split)
      oPO4 < 36.4     to the left,  agree=0.767, adj=0.125, (0 split)

Node number 59: 9 observations
  mean=25.17778, MSE=201.5862 

Node number 116: 8 observations
  mean=8.9, MSE=72.1375 

Node number 117: 22 observations,    complexity param=0.01285848
  mean=17.79545, MSE=95.35316 
  left son=234 (14 obs) right son=235 (8 obs)
  Primary splits:
      season splits as  LRLR,         improve=0.1799351, (0 missing)
      NO3    < 1.2775   to the right, improve=0.1189142, (0 missing)
      Chla   < 31.869   to the left,  improve=0.1160835, (0 missing)
      NH4    < 99.6665  to the left,  improve=0.1091819, (0 missing)
      Cl     < 42.818   to the left,  improve=0.1085589, (0 missing)
  Surrogate splits:
      NO3   < 3.2455   to the left,  agree=0.727, adj=0.250, (0 split)
      oPO4  < 103.293  to the left,  agree=0.727, adj=0.250, (0 split)
      size  splits as  RLL,          agree=0.682, adj=0.125, (0 split)
      speed splits as  LRL,          agree=0.682, adj=0.125, (0 split)
      mnO2  < 7.4      to the left,  agree=0.682, adj=0.125, (0 split)

Node number 234: 14 observations
  mean=14.66429, MSE=48.19658 

Node number 235: 8 observations
  mean=23.275, MSE=130.6944 


rpart() grows a tree until one of the following criterion is met:

  • decrease in deviance goes below a certain threshold (cp)
  • number of samples in a node is below some other threshold (minsplit)
  • depth of the tree crosses yet another threshold (maxdepth)

rpart implements a pruning method based on cost complexity: finding the value of cp which best balances predictive accuracy and tree size.

In [74]:
printcp(regression.tree.a2)
Regression tree:
rpart(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)

Variables actually used in tree construction:
[1] Chla   Cl     mnO2   mxPH   NH4    season size  

Root node error: 29355/218 = 134.66

n= 218 

         CP nsplit rel error  xerror    xstd
1  0.150828      0   1.00000 1.01145 0.19979
2  0.119436      1   0.84917 0.97486 0.17311
3  0.071786      2   0.72974 0.91983 0.16729
4  0.045458      3   0.65795 0.93350 0.17003
5  0.022440      4   0.61249 0.97428 0.18067
6  0.022286      5   0.59005 0.99252 0.19095
7  0.021564      6   0.56777 0.97227 0.17123
8  0.015814      8   0.52464 0.99885 0.18158
9  0.012858      9   0.50883 0.98508 0.17992
10 0.010559     10   0.49597 1.00428 0.18139
11 0.010000     11   0.48541 0.99536 0.18138

The tree returned by rpart() is the final one (cp$=0.01$ is the default value), requires 11 tests, and has a relative error of 0.485. Internally, rpart() uses 10-fold cross-validation to estimate that the tree has an average relative error of $0.98\pm 0.18$ (these values might change when you run the model due to the stochastic nature of the internal cross-validation routines).

In this framework, the optimal tree minimizes the value of xerror. Alternatively, one could use the 1-SE rule to find the minimal xerror + xstd tree.

In [75]:
plotcp(regression.tree.a2)

The scree plot above suggests that cp$=0.08$ (that value may change when you run yours due to the stochastic nature of the internal cross-validation algorithm) is a special value for tree growth, so we could prune the tree using that specific value.

In [76]:
regression.tree.a2.mod <- prune(regression.tree.a2,cp=0.05)
In [77]:
regression.tree.a2.mod
n= 218 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 218 29355.130  7.636697  
   2) Cl< 16.6875 83  1193.640  1.889157 *
   3) Cl>=16.6875 135 23733.920 11.170370  
     6) mxPH< 8.065 59  3831.829  5.386441 *
     7) mxPH>=8.065 76 16396.040 15.660530  
      14) Chla>=2.65 68  9694.089 13.854410 *
      15) Chla< 2.65 8  4594.669 31.012500 *
In [78]:
prp(regression.tree.a2.mod,extra=101,box.col="orange",split.box.col="gray")

The entire process is automated in the wrapper method rpartXse() provided with the DMwr library. In our implementation, we (abitrarily) use se$=0.3$.

In [79]:
library(DMwR)
In [80]:
(regression.tree.a2.final <- rpartXse(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train,se=0.3))
n= 218 

node), split, n, deviance, yval
      * denotes terminal node

1) root 218 29355.130  7.636697  
  2) Cl< 16.6875 83  1193.640  1.889157 *
  3) Cl>=16.6875 135 23733.920 11.170370  
    6) mxPH< 8.065 59  3831.829  5.386441 *
    7) mxPH>=8.065 76 16396.040 15.660530 *
In [81]:
summary(regression.tree.a2.final)
Call:
rpart(formula = form, data = data, cp = cp, minsplit = minsplit)
  n= 218 

         CP nsplit rel error    xerror      xstd
1 0.1508276      0 1.0000000 1.0086512 0.2002729
2 0.1194357      1 0.8491724 0.9951432 0.1955481
3 0.0821150      2 0.7297366 0.9226403 0.1761566

Variable importance
   Cl  mxPH   NH4  Chla  oPO4   PO4 speed   NO3  mnO2 
   20    15    14    13    13    12     8     3     2 

Node number 1: 218 observations,    complexity param=0.1508276
  mean=7.636697, MSE=134.6565 
  left son=2 (83 obs) right son=3 (135 obs)
  Primary splits:
      Cl   < 16.6875  to the left,  improve=0.15082760, (0 missing)
      mxPH < 7.94     to the left,  improve=0.14900670, (0 missing)
      NO3  < 0.18     to the right, improve=0.11564070, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.11106510, (0 missing)
      Chla < 12.21    to the left,  improve=0.09817759, (0 missing)
  Surrogate splits:
      PO4   < 70.465   to the left,  agree=0.844, adj=0.590, (0 split)
      oPO4  < 19.8635  to the left,  agree=0.835, adj=0.566, (0 split)
      NH4   < 46.35    to the left,  agree=0.807, adj=0.494, (0 split)
      Chla  < 2.225    to the left,  agree=0.807, adj=0.494, (0 split)
      speed splits as  RRL,          agree=0.775, adj=0.410, (0 split)

Node number 2: 83 observations
  mean=1.889157, MSE=14.38121 

Node number 3: 135 observations,    complexity param=0.1194357
  mean=11.17037, MSE=175.8068 
  left son=6 (59 obs) right son=7 (76 obs)
  Primary splits:
      mxPH < 8.065    to the left,  improve=0.14772320, (0 missing)
      NO3  < 0.319    to the right, improve=0.11867590, (0 missing)
      Chla < 2.85     to the right, improve=0.06998520, (0 missing)
      mnO2 < 11.05    to the left,  improve=0.06691501, (0 missing)
      NH4  < 474.1875 to the right, improve=0.05884159, (0 missing)
  Surrogate splits:
      NH4  < 272.2915 to the right, agree=0.689, adj=0.288, (0 split)
      Chla < 11.65    to the left,  agree=0.667, adj=0.237, (0 split)
      NO3  < 5.37     to the right, agree=0.659, adj=0.220, (0 split)
      mnO2 < 6.55     to the left,  agree=0.622, adj=0.136, (0 split)
      oPO4 < 279.5085 to the right, agree=0.607, adj=0.102, (0 split)

Node number 6: 59 observations
  mean=5.386441, MSE=64.94626 

Node number 7: 76 observations
  mean=15.66053, MSE=215.7374 

In [82]:
prp(regression.tree.a2.final,extra=101,box.col="orange",split.box.col="gray")

The resulting tree is not nearly as complex as the original tree (hence discourages overfitting) but is still more complex than the pruned tree (which should improve predicting accuracy).


Back to top

MODEL EVALUATION

At this stage, we know that the linear model is not great for a2, and we've seen how to grow a regression tree for a2 but we have not yet discussed whether this model is a good fit, to say nothing of the remaining 6 algae concentrations.

Can we get a better handle on these models' performance (i.e. comparing the model predictions to the real values of the target variable in the test data)?

Various metrics can be used to determine how the values compare:

  • mean absolute error: (MAE) $$\text{mean}\left\{\left|\text{pred}_i-\text{real}_i\right|; i=1,..., N\right\}$$
  • mean squared error: (MSE) $$\text{mean}\left\{\left(\text{pred}_i-\text{real}_i\right)^2; i=1,..., N\right\}$$
  • normalized mean squared error (NMSE): $$\frac{\text{MSE}}{\text{mean}\left\{\left(\overline{\text{real}}-\text{real}_i\right)^2; i=1,..., N\right\}}$$

As the ratio of MSE to a baseline predictor (the mean of the value of the target), NMSE is unitless. NMSE values between 0 and 1 (smaller is better) indicate that the model performs better than the baseline; greater than 1 indicate that the model's performance is sub-par.

There are other metrics, but these are the most common.

So how does model evaluation work? Typically, one computes the selected evaluation metric for a variety of models, and picks the model which provides the optimal metric value.

$k$-fold cross-validation is often used to try to mitigate the effects of overfitting for mid-size datasets (no more than 50,000 observations, say).

  1. The training data is randomly partitioned into $k$ distinct subsets of equal size (give or take an observation here and there)
  2. Each of the $k$ subsets is used as a testing set, while the remaining $k-1$ subsets are used to train a different version of the model (giving rise to $k$ models, and so to $k$ evaluation values on the $k$ testing sets)
  3. The first two steps are repeated $n$ times
  4. The final $k-$fold score is the average of the $n\times k$ evaluation values (often the standard error is also recorded)

performanceEvaluation() can be used to automate the process. Here, we'll use the linear model and 5 regressiont trees (parameterized by se), and 5 $10-$fold cross-validation to select the best predictive model for a2.

In [83]:
library(performanceEstimation)
In [84]:
kCV.results.algae.a2 <- performanceEstimation(
    PredTask(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train,"a2"),
    c(Workflow(learner="lm",post="onlyPos"),
    workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.25,0.5,0.75,1)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)

##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v4 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v5 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************

A summary and plot of the cross-validation results for NMSE can be displayed using calls to summary() and plot().

In [85]:
summary(kCV.results.algae.a2)
== Summary of a  Cross Validation Performance Estimation Experiment ==

Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 

* Predictive Tasks ::  a2
* Workflows  ::  lm, rpartXse.v1, rpartXse.v2, rpartXse.v3, rpartXse.v4, rpartXse.v5 

-> Task:  a2
  *Workflow: lm 
             nmse
avg     0.9880781
std     0.3682616
med     0.9470239
iqr     0.2817843
min     0.4869917
max     2.5236216
invalid 0.0000000

  *Workflow: rpartXse.v1 
             nmse
avg     1.0333720
std     0.3406970
med     1.0000000
iqr     0.1842643
min     0.6171205
max     2.4535376
invalid 0.0000000

  *Workflow: rpartXse.v2 
             nmse
avg     1.0596868
std     0.3147441
med     1.0000000
iqr     0.0435684
min     0.5049684
max     2.4535376
invalid 0.0000000

  *Workflow: rpartXse.v3 
                nmse
avg     1.028517e+00
std     2.301808e-01
med     1.000000e+00
iqr     3.330669e-16
min     5.283415e-01
max     2.365684e+00
invalid 0.000000e+00

  *Workflow: rpartXse.v4 
                nmse
avg     1.012748e+00
std     7.803506e-02
med     1.000000e+00
iqr     3.330669e-16
min     8.198276e-01
max     1.413850e+00
invalid 0.000000e+00

  *Workflow: rpartXse.v5 
                nmse
avg     1.001631e+00
std     1.153258e-02
med     1.000000e+00
iqr     3.330669e-16
min     1.000000e+00
max     1.081548e+00
invalid 0.000000e+00
In [86]:
plot(kCV.results.algae.a2)

It's not necessarily clear which of the models has smaller values of NMSE, although it does seem that the latter versions of the regression tree models are not substantially better than the baseline model.

The first regression tree model sometimes produces very small NMSE values, but that's offset by some of the larger values it also produces (similarly for the linear model).

At any rate, visual evidence seems to suggest that the linear model is the best predictive model for a2 given the training data (in this version of $k$CV), which is corrobated by a call to topPerformers().

In [87]:
topPerformers(kCV.results.algae.a2)
$a2 =
WorkflowEstimate
nmselm 0.988

This might seem disheartening at first given how poorly the linear model performed, but it might be helpful to remember that there is no guarantee that a decent predictive model even exists.

Furthermore, regression trees and linear models are only two of a whole collection of possible models. How do support vector machines perform the task, for instance?


This time, however, we will learn models and perform evaluation for all target variables (a1-a7) simultaneously. This does not mean that we are looking for a single model which will optimize all learning tasks at once, but rather that we can prepare and evaluate the models for each target variable with the same bit of code.

This first require some code to create the appropriate model formulas (a1 ~ ., ... ,a7 ~ .) and the appropriate training data.

In [88]:
gg<-function(x,list.of.variables){    
    PredTask(as.formula(paste(x,"~ .")),algae.train[,c(list.of.variables,x)], x, copy=TRUE)
}
data.sources <- sapply(names(algae.train[12:18]),gg,names(algae.train[1:11]))
data.sources
$a1
Prediction Task Object:
	Task Name         :: a1 
	Task Type         :: regression 
	Target Feature    :: a1 
	Formula           :: a1 ~ .
<environment: 0xae16bd0>
	Task Data Source  :: internal  218x12 data frame.

$a2
Prediction Task Object:
	Task Name         :: a2 
	Task Type         :: regression 
	Target Feature    :: a2 
	Formula           :: a2 ~ .
<environment: 0xad7db40>
	Task Data Source  :: internal  218x12 data frame.

$a3
Prediction Task Object:
	Task Name         :: a3 
	Task Type         :: regression 
	Target Feature    :: a3 
	Formula           :: a3 ~ .
<environment: 0xa333ec8>
	Task Data Source  :: internal  218x12 data frame.

$a4
Prediction Task Object:
	Task Name         :: a4 
	Task Type         :: regression 
	Target Feature    :: a4 
	Formula           :: a4 ~ .
<environment: 0xa2d98e8>
	Task Data Source  :: internal  218x12 data frame.

$a5
Prediction Task Object:
	Task Name         :: a5 
	Task Type         :: regression 
	Target Feature    :: a5 
	Formula           :: a5 ~ .
<environment: 0xad09890>
	Task Data Source  :: internal  218x12 data frame.

$a6
Prediction Task Object:
	Task Name         :: a6 
	Task Type         :: regression 
	Target Feature    :: a6 
	Formula           :: a6 ~ .
<environment: 0xa77c288>
	Task Data Source  :: internal  218x12 data frame.

$a7
Prediction Task Object:
	Task Name         :: a7 
	Task Type         :: regression 
	Target Feature    :: a7 
	Formula           :: a7 ~ .
<environment: 0xa78f860>
	Task Data Source  :: internal  218x12 data frame.

We shall use e1071's implementation of svm(), with various values of the svm()-specific parameters cost and gamma.

In [89]:
library(e1071)
In [90]:
kCV.results.algae.all <- performanceEstimation(
    data.sources,
    c(Workflow(learner="lm",post="onlyPos"), 
      Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
      workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)

##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a1

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a3

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a4

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a5

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a6

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a7

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************

The rest of the evaluation proceeds much as before, except that we can display results for the 7 target variables simultaneously.

In [91]:
plot(kCV.results.algae.all)
In [92]:
rankWorkflows(kCV.results.algae.all,top=3)
$a1
$nmse =
WorkflowEstimate
rpartXse.v10.5936469
svm 0.6625644
rpartXse.v20.6634751
$a2
$nmse =
WorkflowEstimate
lm 0.9880781
svm 0.9937575
rpartXse.v31.0016310
$a3
$nmse =
WorkflowEstimate
svm 0.9322382
lm 0.9609420
rpartXse.v21.0000000
$a4
$nmse =
WorkflowEstimate
rpartXse.v31.085440
rpartXse.v21.320213
lm 1.470017
$a5
$nmse =
WorkflowEstimate
svm 0.9294475
rpartXse.v31.0035338
rpartXse.v21.0321593
$a6
$nmse =
WorkflowEstimate
svm 1.031361
rpartXse.v31.060985
rpartXse.v21.100113
$a7
$nmse =
WorkflowEstimate
rpartXse.v21.000000
rpartXse.v31.000000
rpartXse.v11.068913
In [93]:
topPerformers(kCV.results.algae.all)
$a1
WorkflowEstimate
nmserpartXse.v10.594
$a2
WorkflowEstimate
nmselm 0.988
$a3
WorkflowEstimate
nmsesvm 0.932
$a4
WorkflowEstimate
nmserpartXse.v31.085
$a5
WorkflowEstimate
nmsesvm 0.929
$a6
WorkflowEstimate
nmsesvm 1.031
$a7
WorkflowEstimate
nmserpartXse.v21

For a1, the models seem to perform reasonably well. But it's not so rosy for the other target variables, where the baseline model is sometimes better.

Again, this could be built-in in the data, but we might benefit from incorporating more models. As a last attempt to improve the situation, let's include random forests.

In [94]:
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin

The following object is masked from ‘package:psych’:

    outlier

In [95]:
kCV.results.algae.all.rf <- performanceEstimation(
    data.sources,
    c(Workflow(learner="lm",post="onlyPos"), 
      Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
      workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1))),
      workflowVariants(learner="randomForest",learner.pars=list(ntree=c(200,500,700)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)

##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a1

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a3

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a4

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a5

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a6

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a7

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************
In [96]:
rankWorkflows(kCV.results.algae.all.rf,top=3)
$a1
$nmse =
WorkflowEstimate
randomForest.v30.5457269
randomForest.v20.5461815
randomForest.v10.5489013
$a2
$nmse =
WorkflowEstimate
randomForest.v30.8130037
randomForest.v20.8159825
randomForest.v10.8192456
$a3
$nmse =
WorkflowEstimate
randomForest.v20.9247299
randomForest.v30.9252515
randomForest.v10.9260748
$a4
$nmse =
WorkflowEstimate
randomForest.v21.005694
randomForest.v31.005728
randomForest.v11.022519
$a5
$nmse =
WorkflowEstimate
randomForest.v30.7511166
randomForest.v20.7535201
randomForest.v10.7535520
$a6
$nmse =
WorkflowEstimate
randomForest.v30.8902371
randomForest.v20.8964133
randomForest.v10.9074734
$a7
$nmse =
WorkflowEstimate
rpartXse.v21.000000
rpartXse.v31.000000
rpartXse.v11.068913

rankWorkflows() does not report on the standard error, and so we cannot tell whether the differences between the score of the best model and the other models is statistically significant.

randomForest.v3 seems to have the best ranking across all learning tasks, so we will use it as the baseline model.

In [97]:
p<-pairedComparisons(kCV.results.algae.all.rf,baseline="randomForest.v3")
p$nmse$F.test
p$nmse$BonferroniDunn.test
$chi
26.2857142857143
$FF
6.94339622641511
$critVal
0.707123052989597
$rejNull
TRUE
$critDif
3.52218015388584
$baseline
'randomForest.v3'
$rkDifs
lm
4.85714285714286
svm
2.85714285714286
rpartXse.v1
4.57142857142857
rpartXse.v2
3.57142857142857
rpartXse.v3
3.14285714285714
randomForest.v1
1.71428571428571
randomForest.v2
0.428571428571428
$signifDifs
lm
TRUE
svm
FALSE
rpartXse.v1
TRUE
rpartXse.v2
TRUE
rpartXse.v3
FALSE
randomForest.v1
FALSE
randomForest.v2
FALSE

We can reject with 95% certainty that the performance of the baseline method (randomForest.v3) is the same as that of the linear model and the first 2 regression trees, but not that it is better than svm, rpartXse.v3, and the other 2 random forests.

The information is also displayed in the Bonferroni-Dunn CD diagram below.

In [98]:
CDdiagram.BD(p)

Back to top

MODEL PREDICTIONS

Finally, we might actually be interested in generating predictions for each of the target variables in the testing set.

This requires simply that the best performers for each target be brought together in an R object.

In [99]:
best.performers <- sapply(taskNames(kCV.results.algae.all.rf), function(x) topPerformer(kCV.results.algae.all.rf,metric="nmse",task=x))
In [100]:
best.performers
$a1
Workflow Object:
	Workflow ID       ::  randomForest.v3 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=700 
		 learner  -> randomForest 


$a2
Workflow Object:
	Workflow ID       ::  randomForest.v3 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=700 
		 learner  -> randomForest 


$a3
Workflow Object:
	Workflow ID       ::  randomForest.v2 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=500 
		 learner  -> randomForest 


$a4
Workflow Object:
	Workflow ID       ::  randomForest.v2 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=500 
		 learner  -> randomForest 


$a5
Workflow Object:
	Workflow ID       ::  randomForest.v3 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=700 
		 learner  -> randomForest 


$a6
Workflow Object:
	Workflow ID       ::  randomForest.v3 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=700 
		 learner  -> randomForest 


$a7
Workflow Object:
	Workflow ID       ::  rpartXse.v2 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> se=0.7 
		 learner  -> rpartXse 

The observations that form the testing set are shown below, as are the corresponding

In [101]:
rownames(algae.test)
  1. '221'
  2. '222'
  3. '223'
  4. '224'
  5. '225'
  6. '226'
  7. '227'
  8. '228'
  9. '229'
  10. '230'
  11. '231'
  12. '232'
  13. '233'
  14. '234'
  15. '235'
  16. '236'
  17. '237'
  18. '238'
  19. '239'
  20. '240'
  21. '241'
  22. '242'
  23. '243'
  24. '244'
  25. '245'
  26. '246'
  27. '247'
  28. '248'
  29. '249'
  30. '250'
  31. '251'
  32. '252'
  33. '253'
  34. '254'
  35. '255'
  36. '256'
  37. '257'
  38. '258'
  39. '259'
  40. '260'
  41. '261'
  42. '262'
  43. '263'
  44. '264'
  45. '265'
  46. '266'
  47. '267'
  48. '268'
  49. '269'
  50. '270'
  51. '271'
  52. '272'
  53. '273'
  54. '274'
  55. '275'
  56. '276'
  57. '277'
  58. '278'
  59. '279'
  60. '280'
  61. '281'
  62. '282'
  63. '283'
  64. '284'
  65. '285'
  66. '286'
  67. '287'
  68. '288'
  69. '289'
  70. '290'
  71. '291'
  72. '292'
  73. '293'
  74. '294'
  75. '295'
  76. '296'
  77. '297'
  78. '298'
  79. '299'
  80. '300'
  81. '301'
  82. '302'
  83. '303'
  84. '304'
  85. '305'
  86. '306'
  87. '307'
  88. '308'
  89. '309'
  90. '310'
  91. '311'
  92. '312'
  93. '313'
  94. '314'
  95. '315'
  96. '316'
  97. '317'
  98. '318'
  99. '319'
  100. '320'
  101. '321'
  102. '322'
  103. '323'
  104. '324'
  105. '325'
  106. '326'
  107. '327'
  108. '328'
  109. '329'
  110. '330'
  111. '331'
  112. '332'
  113. '333'
  114. '334'
  115. '335'
  116. '336'
  117. '337'
  118. '338'
  119. '339'
  120. '340'
In [102]:
test.observations <- array(dim=c(nrow(algae.test),7,2),dimnames=list(rownames(algae.test),paste("a",1:7),c("actual","predicted")))

The function runWorkflow() will compute the predicted values for each of the targets' best performers. We can then plot the predicted and actual values for each of the testing set targets.

In [103]:
for(j in 1:7){ 
    results <- runWorkflow(best.performers[[j]],
                          as.formula(paste(names(best.performers)[j],"~ .")),
                          algae.train[,c(1:11,11+j)],
                          algae.test[,c(1:11,11+j)])
    test.observations[,j,"actual"] <- results$trues
    test.observations[,j,"predicted"] <- results$preds
}
In [104]:
df.a1 <- as.data.frame(test.observations[,1,])
df.a2 <- as.data.frame(test.observations[,2,])
df.a3 <- as.data.frame(test.observations[,3,])
df.a4 <- as.data.frame(test.observations[,4,])
df.a5 <- as.data.frame(test.observations[,5,])
df.a6 <- as.data.frame(test.observations[,6,])
df.a7 <- as.data.frame(test.observations[,7,])
In [105]:
plot(df.a1,main="a1 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a2,main="a2 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a3,main="a3 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a4,main="a4 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a5,main="a5 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a6,main="a6 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a7,main="a7 - predicted vs. actual")
abline(0,1,col="red")

The models simply are not that great, but we already expected that. The average prediction for each target is shown below.

In [106]:
average.prediction <- apply(algae.train[,12:18],2, mean)
average.prediction
a1
17.4720183486239
a2
7.63669724770642
a3
4.13211009174312
a4
1.98302752293578
a5
4.96376146788991
a6
5.81192660550459
a7
2.50412844036697

Finally, you might be interested in the NMSE metrics for the predicted values and how they compare to the NMSE metrics on the training set. Is any of this surprising?

In [107]:
apply((test.observations[,,"actual"]-test.observations[,,"predicted"])^2,2,sum) / apply((scale(test.observations[,,"actual"],average.prediction,FALSE))^2,2,sum)
a 1
0.408202908875393
a 2
0.878285409214496
a 3
0.774637297894088
a 4
0.911320419406658
a 5
0.723939958895074
a 6
0.846544418623192
a 7
1

Back to top

EXERCISES:

  • once you complete the data cleaning step, run the analysis with a scaled dataset. Do the results change significantly?
  • once you complete the data cleaning step, run the analysis with a PCA-reduced dataset. Do the results change significantly?
  • why do you think we only see one svm model?
  • is there any value in using a multi-stage model (first predict absence or presence of algae, then predict the level)?