Heather Zurel 2022-09-08
The best day of my life was when my partner asked me to marry him. He was curious whether he overpaid for the diamond he bought me (he thinks he got a great deal!) - so let’s see.
I’m going to use the diamonds data set to train a regression model to predict the price of diamonds based on their physical properties, including size/carat, clarity, colour, etc.
The first step is some exploratory data analysis to see if we need to do anything to clean the data before training the model.
P.S. There are some additional analyses, plots, etc. that are included
in the diamonds.r file in github, which have not made it to this
document.
# Setup
library('tidyverse')
library('GGally')
# Import
data(diamonds)
# Exploratory
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
ggpairs(diamonds)

So the x, y, and z values correspond to the physical dimensions of the diamonds in millimeters - these correlate strongly with the carat value, which is a measure of the weight of the diamonds - this makes sense & we really don’t need to include all of them. Some ML models are sensitive to including multiple features which are strongly correlated with each other - this can lead to overfitting - so we will remove the x, y, and z variables before we train the model.
Q-Q plots (or quantile-quantile plots) are used to quickly, visually identify the distribution of a single variable.
qq_diamonds <- qqnorm((diamonds$price),main="Normal Q-Q Plot of Price");qqline((diamonds$price))

# Meh
qq_log_diamonds <- qqnorm(log(diamonds$price),main="Normal Q-Q Plot of log Price");qqline(log(diamonds$price))

# Ooh this is a much better fit
hist_norm <- ggplot(diamonds, aes(log(price))) +
geom_histogram(aes(y = ..density..), colour = "black", fill = 'lightblue', bins = 50) +
stat_function(fun = dnorm, args = list(mean = mean(log(diamonds$price)), sd = sd(log(diamonds$price))))
hist_norm

Based on the Q-Q plots and the histogram, it appears that the log of the price follows a bimodal or multimodal distribution. Let try another couple plots to see.
violin <- ggplot(diamonds, aes(x = color, y = log(price), fill = color))
violin + geom_violin() + scale_y_log10() + facet_grid(clarity ~ cut)

carat <- ggplot(data = diamonds, aes(x = carat, y = log(price), colour = color))
carat + stat_ecdf() + facet_grid(clarity ~ cut)

Yep, this definitely looks like a multimodal distribution with multiple peaks in the distribution corresponding to increasing the carat of the diamond from 0.99 to 1 carat, 1.99 to 2 carats, etc. And smaller jumps around 1/2 carats.
One more thing I’m going to check is the variance of the numeric variables. If the variance of any variable is >= 1 order of magnitude different from the others, we will standardize these values. If the variance of one variable is much larger than others, it can over-emphasize the importance of these variables in training models.
diamonds %>% summarise_if(is.numeric, list(mean = mean, var = var)) %>% t()
## [,1]
## carat_mean 7.979397e-01
## depth_mean 6.174940e+01
## table_mean 5.745718e+01
## price_mean 3.932800e+03
## x_mean 5.731157e+00
## y_mean 5.734526e+00
## z_mean 3.538734e+00
## carat_var 2.246867e-01
## depth_var 2.052404e+00
## table_var 4.992948e+00
## price_var 1.591563e+07
## x_var 1.258347e+00
## y_var 1.304472e+00
## z_var 4.980109e-01
The carat variable is > 1 order of magnitude less than that of the
table variable (and so close to 1 OOM smaller than depth), so we
will go ahead and standardize table and depth. However this should occur
after the data set has been split into the training and testing data
sets.
I think I have enough info for the next step…
We are going to remove some variables that are strongly correlated with each other, leaving a single variable which captures that data contained in the other 3 variables.
We are going to convert the price to the log of the price. Since this data set is from 2017 and I am trying to predict the value of a diamond bought in 2021, we will also adjust for inflation (approx 10.55%).
Another important consideration before training the models is to deal
with the categorical data. Often, these will be converted to “dummy
variables” or one-hot-encoded. This works when there is no natural
ranking or order of the categories. Here, the cut, clarity, and color
all have a natural order. For example, a diamond with a “good” cut is
better than a diamond with a “fair” cut. If you imported this data from
r (data(diamonds)) then these variables will already be factors with
the correct order. However, if you downloaded a csv of this data set,
these will need to be converted from strings to ordered factors, so I
will include the transformation step here (even though it shouldn’t
change anything in my data set - though it appears that the ‘color’
variable is in the reverse order so I’ll fix that too). Note: the levels
in this function are assigned from worst to best.
After investigating the table and depth fields some more, these values are the ratio to the average diameter of the diamond. The table % influences the light performance of the diamond (i.e. how sparkly it looks). The depth % affect the diamond’s brilliance and fire (I’m not 100% sure what this means, but we’ll see if it affects price).
diamonds <- diamonds %>%
mutate(price = price * 1.1055) %>%
mutate(log_price = log(price)) %>%
select(-price, -x, -y, -z) %>%
mutate(cut = factor(cut, levels = c('Fair', 'Good', 'Very Good', 'Premium', 'Ideal'), ordered = TRUE),
color = factor(color, levels = c('J', 'I', 'H', 'G', 'F', 'E', 'D'), ordered = TRUE),
clarity = factor(clarity, levels = c('I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'), ordered = TRUE))
# This should show that these three variables are now ordered factors.
str(diamonds)
## tibble [53,940 × 7] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "J"<"I"<"H"<"G"<..: 6 6 6 2 1 1 2 3 6 3 ...
## $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ log_price: num [1:53940] 5.89 5.89 5.89 5.91 5.91 ...
# Here's a little trick to get R to output the order all possible factor levels, instead of just the first few:
min(diamonds$cut)
## [1] Fair
## Levels: Fair < Good < Very Good < Premium < Ideal
min(diamonds$color)
## [1] J
## Levels: J < I < H < G < F < E < D
min(diamonds$clarity)
## [1] I1
## Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
Time to prepare the model for training. We will split the data into the training and testing data sets.
Note: we also standardize (scale) the data after splitting the data set to avoid data leakage. This means that the values of the training data set are affecting the values of the testing data set because their values are used in the standardization step. This can affect the performance of the model when running the model on new data.
# Prep
library(caTools)
library(tictoc)
set.seed(42)
tic.clearlog()
split <- sample.split(diamonds$log_price, SplitRatio = 0.8)
diamonds_train <- subset(diamonds, split == TRUE)
diamonds_test <- subset(diamonds, split == FALSE)
diamonds_train <- diamonds_train %>%
mutate_at(c('table', 'depth'), ~(scale(.) %>% as.vector))
diamonds_test <- diamonds_test %>%
mutate_at(c('table', 'depth'), ~(scale(.) %>% as.vector))
glimpse(diamonds_test)
## Rows: 9,706
## Columns: 7
## $ carat <dbl> 0.30, 0.23, 0.30, 0.23, 0.23, 0.32, 0.32, 0.24, 0.30, 0.24, …
## $ cut <ord> Good, Very Good, Very Good, Very Good, Very Good, Good, Good…
## $ color <ord> I, H, J, F, E, H, H, F, I, E, H, F, E, H, G, G, H, G, F, G, …
## $ clarity <ord> SI2, VS1, VS2, VS1, VS1, SI2, SI2, SI1, SI1, VVS1, SI1, VVS2…
## $ depth <dbl> 1.0417048, -0.5231586, 0.2932918, -0.5911962, -1.5437217, 0.…
## $ table <dbl> -0.6422915, -0.1949679, -0.1949679, -0.1949679, 0.2523558, -…
## $ log_price <dbl> 5.961084, 5.966766, 5.978034, 5.978034, 6.096750, 6.099234, …
# And let's check out the standardized variables:
mean(diamonds_test$table)
## [1] 8.708704e-16
sd(diamonds_test$table)
## [1] 1
# The other one (depth) look similar too, you can use the code in diamonds.r to check for yourself.
Now that the data set is split into the testing and training data sets, we will train a few different models, test their performance, and use the best one to make our prediction.
tic('mlm')
mlm <- lm(log_price ~ ., diamonds_train)
toc(log = TRUE, quiet = TRUE)
summary(mlm)
##
## Call:
## lm(formula = log_price ~ ., data = diamonds_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8529 -0.2258 0.0605 0.2531 1.5885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.042146 0.004654 1298.384 < 2e-16 ***
## carat 2.167970 0.003860 561.587 < 2e-16 ***
## cut.L 0.065157 0.007584 8.592 < 2e-16 ***
## cut.Q -0.009315 0.006074 -1.534 0.1251
## cut.C 0.030696 0.005215 5.886 3.98e-09 ***
## cut^4 0.006782 0.004174 1.625 0.1042
## color.L 0.510645 0.005811 87.878 < 2e-16 ***
## color.Q -0.159645 0.005279 -30.240 < 2e-16 ***
## color.C 0.004640 0.004939 0.940 0.3475
## color^4 0.039363 0.004538 8.674 < 2e-16 ***
## color^5 0.022161 0.004290 5.165 2.41e-07 ***
## color^6 0.004631 0.003903 1.187 0.2354
## clarity.L 0.768912 0.010191 75.449 < 2e-16 ***
## clarity.Q -0.366598 0.009537 -38.441 < 2e-16 ***
## clarity.C 0.216408 0.008156 26.534 < 2e-16 ***
## clarity^4 -0.063964 0.006507 -9.830 < 2e-16 ***
## clarity^5 0.052507 0.005299 9.910 < 2e-16 ***
## clarity^6 0.006066 0.004606 1.317 0.1879
## clarity^7 0.007096 0.004069 1.744 0.0812 .
## depth -0.001029 0.001921 -0.535 0.5923
## table 0.014223 0.002188 6.500 8.14e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3442 on 44213 degrees of freedom
## Multiple R-squared: 0.8884, Adjusted R-squared: 0.8884
## F-statistic: 1.76e+04 on 20 and 44213 DF, p-value: < 2.2e-16
tic('poly')
poly <- lm(log_price ~ poly(carat,3) + color + cut + clarity + poly(table,3) + poly(depth,3), diamonds_train)
toc(log = TRUE, quiet = TRUE)
summary(poly)
##
## Call:
## lm(formula = log_price ~ poly(carat, 3) + color + cut + clarity +
## poly(table, 3) + poly(depth, 3), data = diamonds_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2508 -0.0859 -0.0011 0.0872 1.9011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.855146 0.001432 5486.562 < 2e-16 ***
## poly(carat, 3)1 224.223650 0.153651 1459.304 < 2e-16 ***
## poly(carat, 3)2 -65.007798 0.136049 -477.825 < 2e-16 ***
## poly(carat, 3)3 20.466149 0.135187 151.391 < 2e-16 ***
## color.L 0.441946 0.002259 195.623 < 2e-16 ***
## color.Q -0.086252 0.002053 -42.007 < 2e-16 ***
## color.C 0.009905 0.001916 5.169 2.36e-07 ***
## color^4 0.011069 0.001761 6.284 3.32e-10 ***
## color^5 0.008375 0.001664 5.033 4.86e-07 ***
## color^6 -0.001586 0.001514 -1.047 0.294889
## cut.L 0.091883 0.003636 25.273 < 2e-16 ***
## cut.Q -0.009912 0.002696 -3.676 0.000237 ***
## cut.C 0.012412 0.002088 5.944 2.80e-09 ***
## cut^4 -0.002513 0.001625 -1.546 0.122172
## clarity.L 0.887564 0.003990 222.462 < 2e-16 ***
## clarity.Q -0.244990 0.003728 -65.708 < 2e-16 ***
## clarity.C 0.141636 0.003181 44.523 < 2e-16 ***
## clarity^4 -0.062813 0.002532 -24.807 < 2e-16 ***
## clarity^5 0.029161 0.002058 14.167 < 2e-16 ***
## clarity^6 -0.003592 0.001787 -2.010 0.044466 *
## clarity^7 0.029769 0.001579 18.854 < 2e-16 ***
## poly(table, 3)1 -0.691116 0.179675 -3.846 0.000120 ***
## poly(table, 3)2 -0.603061 0.140336 -4.297 1.73e-05 ***
## poly(table, 3)3 0.591239 0.136248 4.339 1.43e-05 ***
## poly(depth, 3)1 -1.288272 0.162940 -7.906 2.71e-15 ***
## poly(depth, 3)2 -1.105807 0.168139 -6.577 4.86e-11 ***
## poly(depth, 3)3 -0.057543 0.134870 -0.427 0.669631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1335 on 44207 degrees of freedom
## Multiple R-squared: 0.9832, Adjusted R-squared: 0.9832
## F-statistic: 9.96e+04 on 26 and 44207 DF, p-value: < 2.2e-16
Wow! The polynomial regression appears to fit a lot better!
SVR does not depend on distributions of the underlying dependent and
independent variables. It can also be used to construct a non-linear
model with the kernel = 'radial' option. I think this is the case
because the linear model is performing the worst so far.
tic('svr')
library(e1071)
svr <- svm(formula = log_price ~ .,
data = diamonds_train,
type = 'eps-regression',
kernel = 'radial')
toc(log = TRUE, quiet = TRUE)
Decision Trees use a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model. Training DT models can sometimes lead to over-complex trees that do not generalize the data well. This is called overfitting.
tic('tree')
library(rpart)
tree <- rpart(formula = log_price ~ .,
data = diamonds_train,
method = 'anova',
model = TRUE)
toc(log = TRUE, quiet = TRUE)
tree
## n= 44234
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 44234 46934.3000 7.922656
## 2) carat< 0.695 20143 4323.3280 6.963920
## 4) carat< 0.455 13848 1212.0970 6.718800 *
## 5) carat>=0.455 6295 448.8476 7.503143 *
## 3) carat>=0.695 24091 8615.3050 8.724275
## 6) carat< 0.995 7837 612.7115 8.100690 *
## 7) carat>=0.995 16254 3485.7290 9.024943
## 14) carat< 1.385 10379 1099.8590 8.772585
## 28) clarity=I1,SI2,SI1 5582 243.9614 8.574533 *
## 29) clarity=VS2,VS1,VVS2,VVS1,IF 4797 382.1680 9.003046 *
## 15) carat>=1.385 5875 557.1727 9.470768 *
This takes the decision tree model a step further and uses many decision trees to make better predictions than if you were using any of the single decision trees in this model.
tic('rf')
library(randomForest)
rf <- randomForest(log_price ~ .,
data = diamonds_train,
ntree = 500,
importance = TRUE)
toc(log = TRUE, quiet = TRUE)
rf
##
## Call:
## randomForest(formula = log_price ~ ., data = diamonds_train, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.01159442
## % Var explained: 98.91
XGBoost uses gradient boosted decision trees and is a really robust model which performs very well on a broad variety of applications. It is also not as sensitive to issues that affect the performance of some other models such as multicolinearity, or data normalization/standardization.
tic('xgb')
library(xgboost)
diamonds_train_xgb <- diamonds_train %>%
mutate_if(is.factor, as.numeric)
diamonds_test_xgb <- diamonds_test %>%
mutate_if(is.factor, as.numeric)
xgb <- xgboost(data = as.matrix(diamonds_train_xgb[-7]), label = diamonds_train_xgb$log_price, nrounds = 6166, verbose = 0)
# the rmse stopped decreasing after 6166 rounds
toc(log = TRUE, quiet = TRUE)
Now we will predict the log of the price of diamonds in the test data set using each model we trained to determine which model performs best on data it has not seen.
# Make predictions and compare model performance
tic('predict_all')
mlm_pred <- predict(mlm, diamonds_test)
poly_pred <- predict(poly, diamonds_test)
svr_pred <- predict(svr, diamonds_test)
tree_pred <- predict(tree, diamonds_test)
rf_pred <- predict(rf, diamonds_test)
xgb_pred <- predict(xgb, as.matrix(diamonds_test_xgb[-7]))
toc(log = TRUE, quiet = TRUE)
# Calculate residuals (i.e. how different the predictions are from the log_price of the test data set)
xgb_resid <- diamonds_test_xgb$log_price - xgb_pred
library(modelr)
resid <- diamonds_test %>%
spread_residuals(mlm, poly, svr, tree, rf) %>%
select(mlm, poly, svr, tree, rf) %>%
rename_with( ~ paste0(.x, '_resid')) %>%
cbind(xgb_resid)
predictions <- diamonds_test %>%
select(log_price) %>%
cbind(mlm_pred) %>%
cbind(poly_pred) %>%
cbind(svr_pred) %>%
cbind(tree_pred) %>%
cbind(rf_pred) %>%
cbind(xgb_pred) %>%
cbind(resid) # This will be useful for plotting later
# Calculate R-squared - this describes how much of the variability is explained by the model - the closer to 1, the better
mean_log_price <- mean(diamonds_test$log_price)
tss = sum((diamonds_test_xgb$log_price - mean_log_price)^2 )
square <- function(x) {x**2}
r2 <- function(x) {1 - x/tss}
r2_df <- resid %>%
mutate_all(square) %>%
summarize_all(sum) %>%
mutate_all(r2) %>%
gather(key = 'model', value = 'r2') %>%
mutate(model = str_replace(model, '_resid', ''))
r2_df
## model r2
## 1 mlm 0.8842696
## 2 poly 0.9803430
## 3 svr 0.9847216
## 4 tree 0.9114766
## 5 rf 0.9870050
## 6 xgb 0.9842275
The Random Forest model performed best according to the R^2 value - this
is a measure of how much of the variability in the data set is explained
by the model, so we will mostly focus on this one for visualizations. It
is equal to 1 - RMSE (root mean square error, which describe how
different all the predictions are from the true values in the y_test
data set).
library(ggplot2)
r2_plot <- ggplot(r2_df, aes(x = model, y = r2, colour = model, fill = model)) + geom_bar(stat = 'identity')
r2_plot + ggtitle('R-squared Values for each Model') + coord_cartesian(ylim = c(0.75, 1))

sample <- predictions %>%
slice_sample(n = 1000)
ggplot(sample, aes(x = exp(log_price), y = exp(rf_pred), size = abs(rf_resid))) +
geom_point(alpha = 0.1) + labs(title = 'Predicted vs Actual Cost of Diamonds in USD', x = 'Price', y = 'Predicted Price', size = 'Residuals')

The Random Forest model is performing the best out of all the models we tried. This isn’t surprising, as it is an ensemble method which means it uses the concordance between multiple models to make better predictions than any of the models could make on their own. XGBoost is another example of an ensemble method, and also performed very well. In the second plot, the size is proportionate to the absolute value of the residuals, which means how different the prediction was from the real value.
Which variable(s) were the most important for predicting the price of the diamonds?
varImpPlot(rf)

The values on the x axis indicate how much the prediction error would increase if that variable were not included in the model. As expected, the carat (or size) is the most important variable. Even though table and depth are supposed to impact how sparkly the diamond appears, they don’t actually have that big an effect on the price.
# Training & predicting times
time_log <- tic.log(format = TRUE)
time_log
## [[1]]
## [1] "mlm: 0.057 sec elapsed"
##
## [[2]]
## [1] "poly: 0.154 sec elapsed"
##
## [[3]]
## [1] "svr: 194.602 sec elapsed"
##
## [[4]]
## [1] "tree: 0.248 sec elapsed"
##
## [[5]]
## [1] "rf: 407.62 sec elapsed"
##
## [[6]]
## [1] "xgb: 62.249 sec elapsed"
##
## [[7]]
## [1] "predict_all: 8.219 sec elapsed"
So the best performing models are taking the longest to train. Not a huge surprise since they actually include many models. Keep in mind that this is not a large data set. Most models I have trained for work take hours and hours (I mostly set them up to run overnight). This can be sped up with a really powerful machine in AWS or even a cluster of machines.