In a football game, the coach must decide on what play to run on offense. The options for these play calls can be broken down into either a pass, a run, or a kick – which can be either a field goal attempt or a punt. Many factors influence the decision: field position, down and distance, time left on the clock, and the score. The purpose of this analysis is to develop a model that will predict what type of play a coach will call in a given situation. To accomplish this, I looked at information on the University of Utah’s football team from the 2019 season with head coach Kyle Whittingham.
This first chunk loads the necessary R packages.
library(cfbscrapR)
library(neuralnet)
library(rpart)
library(rpart.plot)
library(randomForest)
library(tidymodels)
library(tidyverse)
The play-by-play data was scraped from College Football Data using cfbscrapR. The commented code collects information on all of the University of Utah’s football games from the 2019 sesason. Then, the code collects play-by-play information for every game against an FBS opponent and combines it into one data frame. This data frame was then saved into an RData file for easy access.
team = "Utah"
#
# fbsTeams <- cfbscrapR::cfb_team()
# games <- cfbscrapR::cfb_game_info(2019,team = team)
# games <- games %>%
# filter(home_team %in% fbsTeams$school & away_team %in% fbsTeams$school) # Filter for games with two FBS Teams
#
# pbp_data <- data.frame()
# for (i in games$week) {
# Sys.sleep(1)
# hold <- cfbscrapR::cfb_pbp_data(year = 2019,team = team, week = i) %>%
# as.data.frame()
# pbp_data <- rbind(pbp_data,hold)
# }
load("Data/pbp_data.RData")
The play-by-play data includes a lot of information such as the offensive and defensive teams, the score for each team, the time on the clock, etc. To prepare the data, only plays where Utah was on offense were considered. Many new variables were derived from the data to aid the analysis. score_diff is the difference in score when the play is called, where positive indicates a lead. sec_remain is the number of seconds remaining in the half. under_two is a boolean indicating whether or not there are under two minutes remaining in the half. The play-by-play information only gives the result of the play, so a new variable, play_call, was created to classify the play call as either a rush, pass, punt, or kick (fg attempt). Other situations such as timeouts, penalties, and kickoffs were filtered out. Finally, only information available to the coach before the play was selected for analysis.
offense_pbp <- pbp_data %>%
filter(offense_play == team) %>%
mutate(score_diff = offense_score - defense_score,
sec_remain = clock.minutes * 60 + clock.seconds + ifelse(period %in% c(1,3),15,0),
under_two = sec_remain <= 120,
down = as.factor(down),
play_call = case_when(
play_type == "Kickoff" ~ "Filter",
play_type == "Rush" ~ "Rush",
play_type == "Pass Incompletion" ~ "Pass",
play_type == "Pass Reception" ~ "Pass",
play_type == "Field Goal Good" ~ "Kick",
play_type == "Field Goal Missed" ~ "Kick",
play_type == "Penalty" ~ "Filter",
play_type == "Timeout" ~ "Filter",
play_type == "Punt" ~ "Punt",
play_type == "Rushing Touchdown" ~ "Rush",
play_type == "Passing Touchdwon" ~ "Pass",
play_type == "Fumble Recovery (Own)" ~ "Rush",
play_type == "Fumble Recovery (Opponent)" ~ "Rush",
play_type == "Blocked Field Goal" ~ "Kick",
play_type == "Sack" ~ "Pass",
play_type == "Safety" ~ "Rush",
play_type == "Pass Interception Return" ~ "Pass"
)) %>%
filter(play_call != "Filter" & !is.na(play_call)) %>%
mutate(play_call = fct_relevel(play_call,c("Kick","Punt","Pass","Rush"))) %>%
select(offense_play,defense_play,score_diff,yards_to_goal,down,distance,sec_remain,under_two,play_call)
The creation of several plots allows for some early exploration of how variables might be related to each other and how they might influence the play call.
offense_pbp %>%
mutate(under_two = ifelse(under_two,"<2 Minutes Remaining",">2 Minutes Remaining"),
under_two = fct_relevel(under_two,">2 Minutes Remaining","<2 Minutes Remaining")) %>%
ggplot(aes(x = yards_to_goal, fill = play_call)) +
geom_density(alpha = .6) +
scale_fill_manual(values = c("#000000","#888888","blue","#CC0000")) +
scale_x_reverse(limits = c(100,0)) +
facet_wrap(vars(under_two)) +
labs(
y = "Density",
x = "Yards to Goal",
fill = NULL,
title = "Play Type by Yards to Goal",
subtitle = "Utah 2019 Football Season",
caption = "Source: College Football Data"
) +
theme_minimal()
This first plot shows the density distribtuion of play calls over different field positions, seperated by the amount of time remaining in the half. Unsurprisingly, Utah attempted more field goals closer to the end zone (where they were more likely to make it) and elected to punt at further distances. Whittingham did not punt often enough at the end of halves to generate a density plot. He was slightly more likely to call a run play close to the end zone over the entire game. During the final two minutes, it appears that Utah was more likely to run on their own side of the field or close to the end zone and more likely to throw while in the opponent’s half. Although, the score of the game may help explain that difference.
At the end of halves, Utah was more likely to run in their own territory and throw on the opponent’s side.
There is a clear split on where Utah will punt and kick around the 32 yard line.
offense_pbp %>%
filter(under_two & play_call %in% c("Pass","Rush")) %>%
mutate(losing = score_diff <= 0,
close_game = score_diff <= 7,
long_distance = ifelse(yards_to_goal>50, "Own Half", "Opponent's Half")) %>%
group_by(losing,close_game,long_distance) %>%
summarize(pass_perc = round(mean(play_call == "Pass"),5),
n_plays = n()) %>%
ungroup() %>%
arrange(long_distance) %>%
mutate(game_state = rep(c("Big Lead","One Score Lead | Tied","Losing"),2),
pass_perc = scales::percent(pass_perc,accuracy = 0.1),
long_distance = c("Opponent's Half","","","Own Half","","")) %>%
select(long_distance,game_state,pass_perc,n_plays) %>%
rename(`Game State` = game_state,
`Field Position` = long_distance,
`Passing Percentage in Final Two Minutes of the Half` = pass_perc,
`Number of Plays` = n_plays)
Field Position | Game State | Passing Percentage in Final Two Minutes of the Half | Number of Plays |
---|---|---|---|
Opponent’s Half | Big Lead | 27.3% | 22 |
One Score Lead | Tied | 50.0% | 8 | |
Losing | 52.6% | 19 | |
Own Half | Big Lead | 16.7% | 18 |
One Score Lead | Tied | 28.6% | 7 | |
Losing | 38.9% | 18 |
To explore that relationship, this table shows the passing percentage in the final two minutes of each half at different distances and scores. A big lead is defined as 8 or more points. Utah is more likely to call a passing play in close games or when trailing and more likely to run with a big lead in order to kill the clock and end the game.
More likely to call a passing play in close games or when trailing, especially on the opponent’s side of the field.
Much more likely to run with a 2 possession lead to kill the clock.
offense_pbp %>%
ggplot(aes(x = yards_to_goal, fill = play_call)) +
geom_histogram(alpha = .8,position = "stack",bins = 20) + #second plot uses position = "fill"
facet_wrap(vars(down),labeller = "label_both",scales = "free_y") +
scale_x_reverse(limits = c(100,0)) +
scale_fill_manual(values = c("#000000","#888888","blue","#CC0000")) +
labs(
y = "Frequency",
x = "Yards to Goal",
fill = NULL,
title = "Play Type by Down and Distance to Goal",
subtitle = "Utah 2019 Football Season",
caption = "Source: College Football Data"
) +
theme_minimal()
These histograms further explore the relationship of play call and distance to goal but break the data down further by the down. The second histogram shows the same data as a proportion to more clearly see the play breakdown in different game states. The play call on both first and second down seems roughly similar with more rushes overall, especially near both ends of the field. Third down shows more passing play calls, especially in bad field position (more than 75 yards to the goal), although the frequency histogram shows that that situation is uncommon. As expected, almost all kick and punt play calls occur on fourth down with field goal attempts happening within 30 yards to the end zone which would equate to around a 50 yard field goal attempt or less. The histograms also clearly show the 4 down territory for Utah in that 30 to 40 yard range where they decide to run or pass instead of either kicking option.
Utah is much more likely to run on early downs, especially close to the end zone.
3rd Down is when Utah starts to pass more often.
There is a clear split on 4th down, where Utah is in four down territory and very likely to go for it.
offense_pbp %>%
filter(distance<50) %>%
ggplot(aes(x = distance, fill = play_call)) +
geom_histogram(alpha = .8,position = "stack") + #second plot uses position = "fill"
facet_wrap(vars(down),labeller = "label_both",scales = "free_y") +
scale_x_reverse() +
scale_fill_manual(values = c("#000000","#888888","blue","#CC0000")) +
labs(
y = "Frequency",
x = "Yards to First Down",
fill = NULL,
title = "Play Type by Down and Distance to First Down",
subtitle = "Utah 2019 Football Season",
caption = "Source: College Football Data"
) +
theme_minimal()
These histograms are set up similarly to the previous pair but are exploring the distance to a first down instead of the distance from the end zone. Almost all first downs start with 10 yards to the first down and Utah rushes about 65% of the time in that situation. The proportion graph for first down is extremely misleading because of the small sample size of first downs with a distance other than 10 yards and makes it appear like Utah almost always runs on first down. The other downs are more distributed over different distances and skew left as distances are more likely to be between 1 and 10 yards. However, there are still a lot of 2nd and 10 situations, the result of rushes for no gains or incomplete passes. Interestingly, this does not seem to impace the type of play call compared to first down as Utah still rushes about 65% of the time on 2nd and 10. However, as the distance changes from 10, there is a change in the proportion of play calls. At longer distances, Utah is more likely to pass to make up ground. This effect is even more profound on 3rd down. For example, Utah will call a pass play around 80% of the time on 3rd and 10. As expected, 4th down is where field goal attempts and punts primarily occur although the distance to a first down doesn’t seem to relate to which (distance to the end zone is far more indicitive). Utah is actually fairly likely to go for it on 4th and 1, and, when they do, they rely on the running game to convert, calling a rush on about 70% of 4th and 1’s.
There is a clear relationship on 3rd down between the distance to a first down and the likelihood of passing the football.
Utah is very likely to go for it in 4th and short situations.
In most short yardage situations, Utah tends to trust their running game to convert the first down.
This analysis will compare multiple different classification models to attempt to predict the play call for a given play state for use in a larger game simulation. The data was split into a training and testing set to train and evaluate the model performance. The split was stratified by the play call to ensure that both sets have the proper porportion of each play call. This is especially important with the small number of kicking and punting plays.
model_data <- offense_pbp %>%
select(-c(offense_play,defense_play,sec_remain))
# Set the seed for reproducibility
set.seed(3012)
# Split into training and testing data
split_data <- initial_split(model_data, prop = 0.6, strata = play_call)
training_data <- training(split_data)
testing_data <- testing(split_data)
The Down and Distance model is a simple multiclass logistic regression that only takes the down and distance to a first down into consideration for the play call.
#Prepare the recipe for the Down and Distance Model
recipe_DD_only <- training_data %>%
recipe(play_call ~ down + distance) %>%
step_dummy(down) %>%
step_center(distance) %>%
step_scale(distance) %>%
prep()
training_baked_DD <- juice(recipe_DD_only)
testing_baked_DD <- recipe_DD_only %>%
bake(testing_data)
# Run the model with our training data
logistic_glm_DD <-
multinom_reg(mode = "classification") %>%
set_engine("nnet") %>%
fit(play_call ~ ., data = training_baked_DD)
# Predict Class
predictions_class_DD <- logistic_glm_DD %>%
predict(new_data = testing_baked_DD) %>%
bind_cols(testing_baked_DD %>% select(play_call))
# Prediction Probabilities
predictions_DD <- logistic_glm_DD %>%
predict(testing_baked_DD, type = "prob") %>%
bind_cols(predictions_class_DD)
The Kitchen Sink model is also a multiclass logistic regression, but it considers all available variables: down, distance, yards_to_goal, score_diff, under_two.
#Prepare the recipe for the rest of the models
recipe_KS_only <- training_data %>%
recipe(play_call ~ .) %>%
step_dummy(down) %>%
step_mutate(under_two = as.integer(under_two)) %>%
step_center(score_diff,yards_to_goal,distance) %>%
step_scale(score_diff,yards_to_goal,distance) %>%
prep()
training_baked_KS <- juice(recipe_KS_only)
testing_baked_KS <- recipe_KS_only %>%
bake(testing_data)
# Run the model with our training data
logistic_glm_KS <-
multinom_reg(mode = "classification") %>%
set_engine("nnet") %>%
fit(play_call ~ ., data = training_baked_KS)
# Predict Class
predictions_class_KS <- logistic_glm_KS %>%
predict(new_data = testing_baked_KS) %>%
bind_cols(testing_baked_KS %>% select(play_call))
# Prediction Probabilities
predictions_KS <- logistic_glm_KS %>%
predict(testing_baked_KS, type = "prob") %>%
bind_cols(predictions_class_KS)
The remaining models also consider all available variables. The Neural Network model creates a neural network to attempt to classify the play call. One hidden layer with 3 nodes performed the best out of a few different variations of network layouts based primarily on the log loss discussed later.
# Run the model with our training data
nn <-
neuralnet(play_call ~ ., data = training_baked_KS, hidden = 3,act.fct = "logistic",
linear.output = FALSE)
# Predict class and prediction probabilities
predictions_NN <- nn %>%
predict(newdata = testing_baked_KS) %>%
as_tibble() %>%
mutate(.pred_class = pmax(V1,V2,V3,V4)) %>%
rename(.pred_Kick = V1,
.pred_Pass = V2,
.pred_Punt = V3,
.pred_Rush = V4) %>%
mutate(.pred_class = case_when(
.pred_class == .pred_Kick ~ "Kick",
.pred_class == .pred_Pass ~ "Pass",
.pred_class == .pred_Punt ~ "Punt",
.pred_class == .pred_Rush ~ "Rush"
),.pred_class = factor(.pred_class,levels = c("Kick","Punt","Pass","Rush"))) %>%
bind_cols(testing_baked_KS %>% select(play_call))
The Classification Tree model, unsurprisingly, uses a classification tree to predict the play call. Pre-pruning the tree to a depth of 6 yielded the best accuracy by avoiding overfitting the training data.
# Run the model with our training data
logistic_CT <- rpart(play_call ~ .,data = training_baked_KS, method = "class", control = rpart.control(cp = 0,maxdepth = 6))
# Predict class and prediction probabilities
predictions_CT <- logistic_CT %>%
predict(newdata = testing_baked_KS) %>%
as_tibble() %>%
mutate(.pred_class = pmax(Kick,Punt,Pass,Rush)) %>%
rename(.pred_Kick = Kick,
.pred_Pass = Pass,
.pred_Punt = Punt,
.pred_Rush = Rush) %>%
mutate(.pred_class = case_when(
.pred_class == .pred_Kick ~ "Kick",
.pred_class == .pred_Pass ~ "Pass",
.pred_class == .pred_Punt ~ "Punt",
.pred_class == .pred_Rush ~ "Rush"
),.pred_class = factor(.pred_class,levels = c("Kick","Punt","Pass","Rush"))) %>%
bind_cols(testing_baked_KS %>% select(play_call))
# Plot the classification tree
rpart.plot(logistic_CT, type = 3, fallen.leaves = TRUE)
One advantage of this model is an interpretable decision tree. The tree first decides on if it’s fourth down or not. If it is, it will predict a punt or a field goal attempt based on distance to the end zone. If it’s not fourth down, the model predicts a rush or pass depending on the other variables. It’s important to remember the values have been scaled so the values on the tree (other than the downs) don’t directly correspond to the true value.
Finally, a random forest approach was also considered.
# Run the model with our training data
logistic_RF <- randomForest(play_call ~ .,data = training_baked_KS)
# Predict class and prediction probabilities
predictions_RF <- logistic_RF %>%
predict(newdata = testing_baked_KS,type = "prob") %>%
as_tibble() %>%
mutate(.pred_class = pmax(Kick,Punt,Pass,Rush)) %>%
rename(.pred_Kick = Kick,
.pred_Pass = Pass,
.pred_Punt = Punt,
.pred_Rush = Rush) %>%
mutate(.pred_class = case_when(
.pred_class == .pred_Kick ~ "Kick",
.pred_class == .pred_Pass ~ "Pass",
.pred_class == .pred_Punt ~ "Punt",
.pred_class == .pred_Rush ~ "Rush"
),.pred_class = factor(.pred_class,levels = c("Kick","Punt","Pass","Rush"))) %>%
bind_cols(testing_baked_KS %>% select(play_call))
The predictions data frames contain the probabilities of each classification along with the predicted and actual class. Several different methods were used to visualize the performance of each model.
Several metrics were calculated to evaluate each model. The accuracy of the models is simply the percentage of correct predictions made, where 0 is always wrong and 1 is always right. The Area Under ROC Cruve (AUC) is the average area under the ROC curves below, where .5 is a useless model and 1 is a perfect model. Kappa is the agreement of the model with reality, similar to accuracy. Log loss considers the probability of the predicition along with the models confidence in the prediction. For example, a correct prediction with a high probability is rewarded with a lower log loss than a correct prediction with a lower probability. It also punishes high confidence wrong predictions. The final model will be used in a game simulator, so accurate probabilities are more important then exact accuracy. Consequently, the log loss is the most important metric for the model’s performance.
modelMetrics <- function(predictDF,modelName) {
tibble(
"log_loss" =
mn_log_loss(predictDF, play_call, .pred_Kick:.pred_Rush) %>%
select(.estimate),
"accuracy" =
accuracy(predictDF, play_call, .pred_class) %>%
select(.estimate),
"auc" =
roc_auc(predictDF, play_call, .pred_Kick:.pred_Rush) %>%
select(.estimate),
"kap" =
kap(predictDF, play_call, .pred_class) %>%
select(.estimate)
) %>%
unnest(everything()) %>%
pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
mutate(model = modelName)
}
metrics_compare <- bind_rows(
modelMetrics(predictions_DD,"Down and Distance"),
modelMetrics(predictions_KS,"Kitchen Sink"),
modelMetrics(predictions_NN,"Neural Net"),
modelMetrics(predictions_CT,"Classification Tree"),
modelMetrics(predictions_RF,"Random Forest"),
)
metrics_compare %>%
mutate(model = fct_relevel(model,c("Down and Distance","Kitchen Sink","Neural Net","Classification Tree","Random Forest"))) %>%
ggplot(aes(fill = model, y = value, x = model)) +
geom_bar(position = "dodge", stat = "identity") +
scale_fill_manual(values = c("#000000","#888888","blue","purple","#CC0000")) +
theme_minimal() +
facet_wrap(vars(metric),scales = "free_y") +
labs(
y = "Value",
x = "Metric",
fill = NULL,
title = "Comparing Our Models",
subtitle = "Higher is Better: Accuracy, AUC and Kappa\nLower is Better: Log Loss"
) +
geom_text(aes(label = round(value, 3)), vjust = 1, size = 3, position = position_dodge(width= 0.9),color = "white") +
theme(axis.text.x = element_blank())
Interestingly the Down and Distance and Kitchen Sink models had the same accuracy. This is likely because both models have the same algorithm and the down and distance variables probably carry the most weight in the Kitchen Sink model. The Kitchen Sink model performed better, though, on AUC and Kappa but worse on log loss. While the Neural Network had the highest accuracy of all of the models, it performed at least six times worse on log loss than all of the other models. The Random Forest model was strictly better than the Classification Tree model on every metric and had the lowest log loss, highest AUC and second highest accuracy and Kappa.
Additionally, confusion matrices and the ROC curves for each model were generated with the below functions to gain more insight into the model preformances. The diagonal from bottom left to top right of the confusion matrix gives the count of each correct prediction. The center line of the ROC curve is the performance of a useless model and the closer the curve is to the upper left corner, the better the performance.
# Generates Confusion Matrix
confusionTab <- function(predictDF,modelName) {
predictDF %>%
conf_mat(play_call, .pred_class) %>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n)) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "white", alpha = 1, size = 8) +
theme_bw() +
theme(panel.grid.major = element_blank()) +
labs(
y = "Actual Play Type",
x = "Predicted Play Type",
fill = NULL,
title = "Confusion Matrix",
subtitle = paste0(modelName," Model")
)
}
# Plots ROC Curve
plotROC <- function(predictDF,modelName) {
predictDF %>%
roc_curve(play_call, .pred_Kick:.pred_Rush) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity,color = .level)) +
geom_path() +
geom_abline(lty = 3) +
scale_color_manual(values = c("#000000","#888888","blue","#CC0000")) +
coord_equal() +
theme_bw() +
labs(
y = "True Positive Rate (Sensitivity)",
x = "False Positive Rate",
fill = NULL,
title = "ROC Curve",
subtitle = paste0(modelName," Model")
)
}
confusionTab(predictions_DD,"Down and Distance")
plotROC(predictions_DD,"Down and Distance")
confusionTab(predictions_KS,"Kitchen Sink")
plotROC(predictions_KS,"Kitchen Sink")
confusionTab(predictions_NN,"Neural Net")
plotROC(predictions_NN,"Neural Net")
confusionTab(predictions_CT,"Classification Tree")
plotROC(predictions_CT,"Classification Tree")
confusionTab(predictions_RF,"Random Forest")
plotROC(predictions_RF,"Random Forest")
While the Neural Network had the highest accuracy of all of the models, it was still extremely confident on its wrong predictions, leading to an extremely high log loss. Since log loss is the most important metric considered for the eventual use of the model in a game simulator, the Neural Network was not selected. After eliminating the Neural Network, the Random Forest model performed best on every metric and is the final selection.
This analysis was limited as it only considered the play calling of the University of Utah football team during the 2019 season. There are still many future avenues to explore to better understand and model football playcalling.
Do the different models still perform the same with data from different teams? How well can the model predict different teams play calling? Is there a change in play calling over multiple seasons, or would including additional seasons improve model performance? Can play calling data be used to cluster several coaches together with an unsupervised machine learning method, and would a model trained on the combined play-by-play data be accurate for the group of coaches?