Baseball Stats Econometrics Tutorial

The purpose of this study is to use regression analysis on baseball statistics to determine the deciding factor in a team’s success in terms of wins in the regular season. To do so sample data was taken from the Cincinnati Reds from 1970 to 1991. Wins was the dependent variable and wins, fielding percentage, Earned Run Average, Batting Average and Walks were the tested independent variables.

To determine the effectiveness of each variable in the model, the best model needs to be determined, testing the data as time series, differencing the variables to make the stationary and testing for other variable effects.

*Side note: The 1981 data has been excluded, as it was in incomplete year due to a strike. This brings 21 years’ worth of observations for 5 variables.

To start off with building the model, the variables’ order of integration needs to be determined.

BA = team batting average

Era= team earned run average

Fielding = team fielding percentage

Walks = total walks for the team

 

All the variables seem to be fairly level, with slight variation in data. Due to this, and just knowing what kind of data is being tested, it is safe to say that the variables are stationary.  Further data to support this comes in the unite root test of fielding:

 

Null Hypothesis: FIELDING has a unit root
Exogenous: Constant
Lag Length: 0 (Automatic based on SIC, MAXLAG=4)
t-Statistic Prob.*
Augmented Dickey-Fuller test statistic -4.424716 0.0029

It is significant at the 1% level.

ERA:

 

Null Hypothesis: ERA has a unit root
Exogenous: Constant
Lag Length: 0 (Automatic based on SIC, MAXLAG=4)
t-Statistic Prob.*
Augmented Dickey-Fuller test statistic -4.806399 0.0013

 

It is significant at the 1% level.

BA:

 

Null Hypothesis: BA has a unit root
Exogenous: Constant
Lag Length: 0 (Automatic based on SIC, MAXLAG=4)
t-Statistic Prob.*
Augmented Dickey-Fuller test statistic -2.866763 0.0680

It is significant at the 10% level.

While tests may suggest that walks and wins is I(1), looking at the graphs and knowing the type of data, they can be determined to be I(0).

 

With thie taken into consideration, this model is formed.

 

Variable Coefficient Std. Error t-Statistic Prob.
C -567.1794 809.8831 -0.700323 0.4938
ERA -0.352712 2.965322 -0.118946 0.9068
BA 451.1626 228.4446 1.974932 0.0658
FIELDING 509.0407 840.6211 0.605553 0.5533
WALKS 0.071144 0.028535 2.493273 0.0240
R-squared 0.599605 Mean dependent var 87.28571
Adjusted R-squared 0.499506 S.D. dependent var 11.82854
S.E. of regression 8.368167 Akaike info criterion 7.291004
Sum squared resid 1120.419 Schwarz criterion 7.539699
Log likelihood -71.55554 Hannan-Quinn criter. 7.344977
F-statistic 5.990136 Durbin-Watson stat 1.143190
Prob(F-statistic) 0.003829

 

However, this model has some high prob. Values, a low R-squared and some high standard errors. To correct this, a better model should be seeked, by adding in lags or removing insignificant variables. Adding in lags gave this model:

 

Variable Coefficient Std. Error t-Statistic Prob.
C -2033.497 616.2475 -3.299806 0.0063
WALKS 0.098112 0.019479 5.036708 0.0003
FIELDING 2126.658 650.9779 3.266867 0.0067
ERA 4.648875 2.019448 2.302052 0.0400
ERA(-1) 5.544185 1.682273 3.295651 0.0064
BA -5.722406 174.8243 -0.032732 0.9744
BA(-1) -213.3401 119.4420 -1.786140 0.0993
R-squared 0.847717 Mean dependent var 87.89474
Adjusted R-squared 0.771575 S.D. dependent var 10.23010
S.E. of regression 4.889355 Akaike info criterion 6.289308
Sum squared resid 286.8695 Schwarz criterion 6.637259
Log likelihood -52.74842 Hannan-Quinn criter. 6.348195
F-statistic 11.13342 Durbin-Watson stat 2.865230
Prob(F-statistic) 0.000264

This one is a better model because the R-squared is higher, the Akaike info criterion and the Scwarz criterion is lower, and the prob values and std. errors are lower. However BA is still highly insignificant. Removing it as a variable gives a better model:

 

Variable Coefficient Std. Error t-Statistic Prob.
C -1734.346 530.3656 -3.270096 0.0056
WALKS 0.094522 0.015640 6.043573 0.0000
FIELDING 1773.763 541.3628 3.276477 0.0055
ERA 3.721749 1.522243 2.444911 0.0283
ERA(-1) 4.375090 1.550882 2.821034 0.0136
R-squared 0.806609 Mean dependent var 87.89474
Adjusted R-squared 0.751355 S.D. dependent var 10.23010
S.E. of regression 5.101170 Akaike info criterion 6.317751
Sum squared resid 364.3070 Schwarz criterion 6.566287
Log likelihood -55.01863 Hannan-Quinn criter. 6.359813
F-statistic 14.59809 Durbin-Watson stat 2.226007
Prob(F-statistic) 0.000067

Though the Akaike info criterion goes slightly up, the Schwarz criterion decreased, all of the variables become highly significant and the std error goes down.

To validate the model a serial correlation test was conducted to see if more lags were needed.

 

Breusch-Godfrey Serial Correlation LM Test:
F-statistic 0.406332 Prob. F(2,12) 0.6749
Obs*R-squared 1.205107 Prob. Chi-Square(2) 0.5474

It is not significant, so no additional lags are needed.

 

In conclusion walks and ERA, as well as the team’s previous years’ era have small coefficients. As a side note, when entering a new year, opposing batters have psychological expectations of the pitcher they face, justifying the lag. Fielding however has the largest coefficient of 1773.763, suggesting it is the most determining factor in a team’s wins. This model performs well because the variable’s outside information could back this claim.

 

Note: The study could be expanded to include other variables.