Machine Learning
Jessica Ayers 2023-07-18
Machine Learning Reflection
What method did you find most interesting?
I found the Random Forest method to be the most interesting. It is a method that has been referenced many times in my academic career, but one that I had never explicitly learned.
About Random Forests
The Random Forest method is a tree-based method that works best with predicting new values. Similar to many tree-based methods, predictability is the goal. A disadvantage of this method is that unlike regression methods, results are not easily interpreted. Bootstrapped samples are used to create random trees from a chosen subset of predictors. As an extension to bagging, the number of predictors included is randomly selected and does not have to stay constant. Compared to bagging, this process improves the overall fit of the model.
Fit the model
The packages used to fit this model are the tidyverse
and
randomForest
. The mlbench
package will be used to access the
BostonHousing data set.
library(tidyverse)
library(randomForest)
library(mlbench)
A training a test data set will be fit on the BostonHousing data set
from the mlbench
package. The training set with have 75% of the data.
set.seed(100)
train <- sample(1:nrow(BostonHousing), size = nrow(BostonHousing)*0.75)
test <- setdiff(1:nrow(BostonHousing), train)
BHTrain <- BostonHousing[train, ]
BHTest <- BostonHousing[test, ]
The response variable is medv
as it is the median value of homes in
thousands. Using a random forest model, the ideal number of predictors
will be found to efficiently predict this variable of interest. We can
first fit the model by centering and scaling the data and using 5-fold
cross validation:
rfFit <- train(medv ~ ., data = BHTrain,
method = "rf",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",number = 5),
tuneGrid = data.frame(mtry = 1:13))
rfFit
## Random Forest
##
## 379 samples
## 13 predictor
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 303, 303, 305, 302, 303
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 4.248261 0.8071827 2.882535
## 2 3.358793 0.8711819 2.335104
## 3 3.119926 0.8832145 2.231008
## 4 3.001733 0.8887147 2.171516
## 5 2.955044 0.8904121 2.147157
## 6 2.939190 0.8897421 2.141188
## 7 2.983622 0.8847750 2.164571
## 8 2.985764 0.8841954 2.169933
## 9 2.992300 0.8827661 2.180265
## 10 3.050389 0.8788192 2.189623
## 11 3.050577 0.8781899 2.188772
## 12 3.078351 0.8756573 2.200749
## 13 3.092172 0.8746336 2.217023
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
rfFit$bestTune
## mtry
## 6 6
The most efficient number of predictors chosen from this model was 6. We can then see how efficient this model is for predicting the median value of homes:
rfPred <- predict(rfFit, newdata = BHTest)
postResample(rfPred, obs = BHTest$medv)
## RMSE Rsquared MAE
## 4.7211604 0.7957949 2.7635342