Drive for Show, Approach for Dough?

Ryan Kyaw
May 10, 2020
5 min read

Drive for show, putt for dough. Every golfer has heard this adage in some form, but is it true? Using data from the 2017 PGA Tour Season along with a linear regression analysis, we are going to see that putting might not be the main recipe for lower scores.

Strokes Gained was introduced in the early 21st century, and it has emerged as the best statistic to predict who the best golfers in the world are. Strokes Gained can be divided into four categories based on the general types of shot a player can have. SG:OTT describes any tee shot on a par 4 or 5, SG:APR describes any approach shot onto a green from at least 30 yards out, SG:ARG describes any shot onto a green from inside 30 yards, and SG:Putting describes what happens on the greens. We can explore some exploratory data analysis to visualize whether a relationship exists between each of the Strokes Gained parameters and a player's average score from the 2017 PGA Tour Season. I have also included the R code and will include a link to the dataset I used at the end of the post.

ggplot(df_2017) +
  geom_point(aes(x = df_2017$SG.OTT, y = df_2017$AVG_SCORE), color = "brown")+
  ggtitle("SG:OTT vs Average Score")+
  xlab("SG:OTT")+
  ylab("Average Score")+
  geom_smooth(method = "lm", aes(x = df_2017$SG.OTT, y = df_2017$AVG_SCORE), color = "blue")

Correlation Coefficient = -0.6471942

ggplot(df_2017) +
  geom_point(aes(x = df_2017$SG.APR, y = df_2017$AVG_SCORE), color = "red")+
  ggtitle("SG:APR vs Average Score")+
  xlab("SG:APR")+
  ylab("Average Score")+
  geom_smooth(method = "lm", aes(x = df_2017$SG.APR, y = df_2017$AVG_SCORE), color = "blue")

Correlation Coefficient: -0.7094093

ggplot(df_2017) +
  geom_point(aes(x = df_2017$SG.ARG, y = df_2017$AVG_SCORE), color = "purple")+
  ggtitle("SG:ARG vs Average Score")+
  xlab("SG:ARG")+
  ylab("Average Score")+
  geom_smooth(method = "lm", aes(x = df_2017$SG.ARG, y = df_2017$AVG_SCORE), color = "red")

Correlation Coefficient: -0.5021516

ggplot(df_2017) +
  geom_point(aes(x = df_2017$SG_PUTTING_PER_ROUND, y = df_2017$AVG_SCORE), color = "blue") +
  ggtitle("SG:Putting vs Average Score")+
  xlab("SG:Putting")+
  ylab("Average Score")+
  geom_smooth(method = "lm", aes(x = df_2017$SG_PUTTING_PER_ROUND, y = df_2017$AVG_SCORE), color = "red")

Correlation Coefficient: -0.3028644

An interesting observation emerges from these scatterplots and their lines of best fit. It appears that putting actually has the smallest correlation between more Strokes Gained and lower scores. This is confirmed by the corresponding correlation coefficients of each of these graphs. In fact, both short game categories seem to fall behind the tee ball and approach shots in being the most important factors to lowering scores.

In this rare case, we actually are not making any statistical predictions. In fact, we have all the data from the 2017 PGA Tour season that making conclusions from these graphs would be plausible. However, we will analyze the linear models created by the Strokes Gained metric to verify that a linear model is appropriate for these relationships. Valid linear models would suggest that the relationships we are seeing in our graphs can be confirmed.

SG:OTT vs Average Score

lm_drive <- lm(AVG_SCORE ~ SG.OTT, data = df_2017)
summary(lm_drive)

## 
## Call:
## lm(formula = AVG_SCORE ~ SG.OTT, data = df_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72500 -0.38127 -0.03402  0.38564  2.09652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 70.94710    0.04515 1571.49   <2e-16 ***
## SG.OTT      -1.26997    0.10768  -11.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6272 on 193 degrees of freedom
## Multiple R-squared:  0.4189, Adjusted R-squared:  0.4158 
## F-statistic: 139.1 on 1 and 193 DF,  p-value: < 2.2e-16

SG:APR vs Average Score

lm_APR <- lm(AVG_SCORE ~ SG.APR, data = df_2017)
summary(lm_APR)

## 
## Call:
## lm(formula = AVG_SCORE ~ SG.APR, data = df_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59364 -0.29726 -0.01785  0.31393  1.98287 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 70.98417    0.04203 1688.85   <2e-16 ***
## SG.APR      -1.46972    0.10510  -13.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5799 on 193 degrees of freedom
## Multiple R-squared:  0.5033, Adjusted R-squared:  0.5007 
## F-statistic: 195.5 on 1 and 193 DF,  p-value: < 2.2e-16

SG:ARG vs Average Score

lm_ARG <- lm(AVG_SCORE ~ SG.ARG, data = df_2017)
summary(lm_ARG)

## 
## Call:
## lm(formula = AVG_SCORE ~ SG.ARG, data = df_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.95838 -0.41526 -0.06013  0.32838  3.01056 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 70.95327    0.05149 1377.980  < 2e-16 ***
## SG.ARG      -1.86555    0.23126   -8.067 7.48e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7115 on 193 degrees of freedom
## Multiple R-squared:  0.2522, Adjusted R-squared:  0.2483 
## F-statistic: 65.08 on 1 and 193 DF,  p-value: 7.484e-14

SG:Putting vs Average Score

lm_putt <- lm(AVG_SCORE ~ SG_PUTTING_PER_ROUND, data = df_2017)
summary(lm_putt)

## 
## Call:
## lm(formula = AVG_SCORE ~ SG_PUTTING_PER_ROUND, data = df_2017)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8340 -0.4115 -0.0352  0.3846  3.6416 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          70.93431    0.05691 1246.361  < 2e-16 ***
## SG_PUTTING_PER_ROUND -0.79180    0.17935   -4.415 1.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7841 on 193 degrees of freedom
## Multiple R-squared:  0.09173,    Adjusted R-squared:  0.08702 
## F-statistic: 19.49 on 1 and 193 DF,  p-value: 1.681e-05

Because we are not analyzing the strength of our linear models, we only need to consider the p-values of each model. Remember, a really low p-value gives us enough evidence to reject our null hypothesis. In this case, our null hypothesis would be that there is not a linear relationship between the Strokes Gained category and average score. For each of these models we have p-values that are very close to zero, and this implies that we can reject our corresponding null hypotheses.

We can confirm a linear relationship between every Strokes Gained category and average score. This further highlights the strength of the Strokes Gained parameter, as it pertains to predicting which players are scoring lower than others. Furthermore, we can also confirm our findings in our graphical analysis that show that there is more to improving scores than putting.

Now, if we only considered our graphs, we could make the conclusion that the approach shots are the most important part of improving scores. However, I would hesitate to make any conclusions about one part of the game being more important than another. There is a much higher probability of penalty strokes after hitting a tee shot or an approach shots compared to putting or chipping where is practically no risk of penalty at all. This suggests that there is a larger range of SG:OTT and SG:APR values among a field of golfers than SG:Putting. The more penalty strokes, the more likely it is to see scores increase and get worse. However, when putting, penalty strokes are practically impossible, and this suggests that it takes more effort to increase scores than hitting tee shots or approach shots.

Ultimately, putting is very pivotal part of golf. Making or missing a midrange 6-12 foot putt could be the difference in winning or losing a tournament. However, I believe that this analysis highlights the importance of avoiding the big mistake. Our Strokes Gained analysis proves that avoiding the big, penalizing miss is just as important in lowering scores than putting or chipping. Being cautiously aggressive and understanding when to be conservative is a course management concept that can be applied every round. While hitting golf shots and putting performance can vary every round, understanding your game and making good decisions can be controlled every round. Especially for the better players, this will be the way to lower scores.

So, we can scrap the whole Drive for Show, "x" for dough saying. Strategy and understanding yourself will be just as important as making that 10 foot putt in the long run.

Link to the dataset that I used: https://www.kaggle.com/grantruedy/pga-tour-golf-data-2017-season

Drive for Show, Approach for Dough?

Recent Posts

Comments