Cleaning CFB play-by-play Data

The purpose of this code is to clean up college football play-by-play data to play nicely with Puntalytics RERUN code and generate the plot I tweeted in response to their article

First we need to load the necessary packages. Tidyverse because always tidyverse. cfbscrapR to pull college football play-by-play data. And then ggimage and ggrepel to make some pretty graphs.

library(tidyverse)
# # Can install cfbscrapR using the devtools package from either of the following:
# devtools::install_github(repo = "saiemgilani/cfbscrapR")
# # or the following (these are the exact same packages):
# devtools::install_github(repo = "meysubb/cfbscrapR")
library(cfbscrapR)
library(ggimage)
library(ggrepel)

First we need to pull in the play-by-play data. This can be done with cfbscrapR’s cfb_pbp_data function. It take a few minutes to pull all of the data so I pre-loaded it into an RDS file to save time.

# pbp_2019 <- cfb_pbp_data(year=2019, season_type = "regular", week=NULL, epa_wpa = TRUE)
pbp_2019 <- readRDS("pbp_2019.rds")

We start by filtering our play-by-play data down to just the punting plays. and then select variables we care about. We’ll also ignore punts that had a penalty. During my first look at those plays, there wasn’t an obvious consistent way to pull out the punt data we wanted but with some actual time and effort, there’s likely a way to get at the data.

punts <- pbp_2019 %>%
  filter(punt == 1)
punts <- punts %>%
  select(offense_play,defense_play,play_type,play_text,yards_gained,yards_to_goal,yards_to_goal_end) %>%
  filter(!(str_detect(play_text,"enalty"))) # Filter out penalty plays to save a headache

Now we can start transforming our data to match what we need for Puntalytics’ code. Using a few str_detects we can figure out if a punt was returned or not. In our data, on a punt, the yards_to_goal_end column is typically the returning team’s yards to goal at the end of the play. So to calculate the net punt yards we can subtract 100 minus that column from the yards to goal at the start of the play of the kicking team. For touchbacks, sometimes the yards to goal at the end was 20 and sometimes it was 80 so a quick ifelse can adjust that for us. The return yards were typically found in the yards_gained column. On a handful of punts, this seemed to break and so I filtered out any punts over 80 yards as being unrealistic. Digging into those punts and cleaning them is future work that could improve this analysis.

punts <- punts %>% 
  mutate(punt_out_of_bounds = ifelse(str_detect(play_text,"bounds"),1,0),
         punt_downed = ifelse(str_detect(play_text,"downed"),1,0),
         punt_fair_catch = ifelse(str_detect(play_text,"fair catch"),1,0),
         touchback = ifelse(str_detect(play_text,"ouchback"),1,0),
         net_punt = yards_to_goal - (100-yards_to_goal_end),
         net_punt = ifelse(yards_to_goal_end == 20 & touchback == 1,net_punt+60,net_punt),
         GrossYards = net_punt + yards_gained,
         return_yards = yards_gained,
         punter = str_extract(play_text,".{0,25} punt"),
         punter = str_remove(punter," punt")
         ) %>%
  filter(GrossYards >= 0 & GrossYards < 80) %>%
  mutate(returned = pmap_dbl(list(punt_out_of_bounds==0 &
                                    punt_downed==0 &
                                    punt_fair_catch==0 &
                                    touchback==0,
                                  1, 0),
                             if_else))

Copying Code

Now we can start copying code! Read Puntaltyics’s article to actually understand what we’re doing here and what RERUN is.

punts <- punts %>%
  mutate(GrossYards_r = ifelse(returned==1, 
                                      GrossYards, NA) )%>% 
  mutate(return_yards_r = ifelse(returned==1,
                                        return_yards, NA))



ggplot(data=punts, mapping=aes(GrossYards_r, return_yards_r)) +
  geom_point(alpha = 0.06) +
  geom_smooth(method="loess", span=0.65) +
  labs(title = "How far punts of various lengths are typically returned",
       subtitle = "(when they're returned at all!)",
       y="Return Yards", x="Gross Yards", caption="figure @ThePuntRunts | data @nflfastR") +
  theme_bw()

The clear diagonal line here is concerning and is another spot worth investigating in the future to improve the analysis.

punts <- punts %>%
  mutate(return_smooth = loess(formula = return_yards_r ~ GrossYards_r,
                               data = punts,
                               span=0.65,
                               na.action = na.exclude) %>% predict)
punts <- punts %>%
  mutate(RERUN = pmap_dbl(list(returned==1,
                               GrossYards_r - return_smooth, GrossYards),
                          if_else))
punts %>% summarise(Net = mean(net_punt),
                    RERUN = mean(RERUN))
## # A tibble: 1 x 2
##     Net RERUN
##   <dbl> <dbl>
## 1  38.2  38.5

Again, the Net and RERUN are slightly off which is concerning and worth looking deeper into.

mini <- punts %>%
  group_by(punter,offense_play) %>%
  filter(n() > 32) %>%
  summarise(RERUN = mean(RERUN),
            Gross = mean(GrossYards),
            Net = mean(net_punt),
            NumPunts= n()) %>% 
  rename(Name = punter) %>%
  arrange(desc(RERUN))

mini %>%
  mutate(across(is.numeric, round, 1)) %>%
  select(Name, RERUN, Net, Gross, NumPunts) %>%
  rmarkdown::paged_table()

Now for the final plot, we’ll need a bit more info from cfbscrapR. The cfb_team_info function will return a nice dataframe with all of the teams, their colors, and their logos. The logos come in a list which requires a bit more cleaning. We can then filter by teams in the P5 conferences and inner_join to get only the P5 punters.

p5teams <- cfbscrapR::cfb_team_info(year = 2019)
p5teams <- p5teams %>% 
  filter(conference %in% c("Pac-12","SEC","Big Ten","Big 12","ACC")) %>%
  mutate(logo = map(logos,magrittr::extract2,1),
         logo = as.character(logo)) %>% 
  select(school,color,logo)
mini2 <- mini %>%
  inner_join(p5teams,by = c("offense_play" = "school")) %>%
  mutate(last = str_extract(Name," .{0,25}"),
         firstI = str_extract(Name,"^."),
         Name = paste0(firstI,". ",last))
colors <- mini2 %>%
  select(offense_play,color) %>%
  arrange(offense_play) %>% 
  pull(color)
ggplot(mini2, mapping=aes(Net, RERUN, label = Name)) +
  geom_abline() +
  ggimage::geom_image(aes(image = logo)) +
  ggrepel::geom_text_repel(aes(color = offense_play),size=3, point.padding = 0.6, force=1.25) +
  annotate("text", x=42.5, y=35.5, label="Helped by coverage team", size=5, fontface = "bold") +
  annotate("text", x=35.5, y=44.5, label="Hurt by coverage team", size=5, fontface = "bold") +
  scale_color_manual(values = colors) +
  labs(title = "Were punters in 2019 helped or hurt by their coverage teams?",
       caption="Figure: @ThePuntRunts, Modified by @JaredDLee | Data: @cfbscrapR & @CFBData") +
  theme_bw() +
  theme(legend.position = "none")