Jessica Ayers 2023-07-18

Machine Learning Reflection

What method did you find most interesting?

I found the Random Forest method to be the most interesting. It is a method that has been referenced many times in my academic career, but one that I had never explicitly learned.

About Random Forests

The Random Forest method is a tree-based method that works best with predicting new values. Similar to many tree-based methods, predictability is the goal. A disadvantage of this method is that unlike regression methods, results are not easily interpreted. Bootstrapped samples are used to create random trees from a chosen subset of predictors. As an extension to bagging, the number of predictors included is randomly selected and does not have to stay constant. Compared to bagging, this process improves the overall fit of the model.

Fit the model

The packages used to fit this model are the tidyverse and randomForest. The mlbench package will be used to access the BostonHousing data set.

library(tidyverse)
library(randomForest)
library(mlbench)

A training a test data set will be fit on the BostonHousing data set from the mlbench package. The training set with have 75% of the data.

set.seed(100)
train <- sample(1:nrow(BostonHousing), size = nrow(BostonHousing)*0.75) 
test <- setdiff(1:nrow(BostonHousing), train)
BHTrain <- BostonHousing[train, ] 
BHTest <- BostonHousing[test, ]

The response variable is medv as it is the median value of homes in thousands. Using a random forest model, the ideal number of predictors will be found to efficiently predict this variable of interest. We can first fit the model by centering and scaling the data and using 5-fold cross validation:

rfFit <- train(medv ~ ., data = BHTrain,
                         method = "rf",
                         preProcess = c("center", "scale"),
                         trControl = trainControl(method = "cv",number = 5),
                         tuneGrid = data.frame(mtry = 1:13))
rfFit

## Random Forest 
## 
## 379 samples
##  13 predictor
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 303, 303, 305, 302, 303 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    1    4.248261  0.8071827  2.882535
##    2    3.358793  0.8711819  2.335104
##    3    3.119926  0.8832145  2.231008
##    4    3.001733  0.8887147  2.171516
##    5    2.955044  0.8904121  2.147157
##    6    2.939190  0.8897421  2.141188
##    7    2.983622  0.8847750  2.164571
##    8    2.985764  0.8841954  2.169933
##    9    2.992300  0.8827661  2.180265
##   10    3.050389  0.8788192  2.189623
##   11    3.050577  0.8781899  2.188772
##   12    3.078351  0.8756573  2.200749
##   13    3.092172  0.8746336  2.217023
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

rfFit$bestTune

##   mtry
## 6    6

The most efficient number of predictors chosen from this model was 6. We can then see how efficient this model is for predicting the median value of homes:

rfPred <- predict(rfFit, newdata = BHTest) 
postResample(rfPred, obs = BHTest$medv)

##      RMSE  Rsquared       MAE 
## 4.7211604 0.7957949 2.7635342

Online News Popularity Data Blog Post

Reflection