Baseball Stats Econometrics Tutorial
The purpose of this study is to use regression analysis on baseball statistics to determine the deciding factor in a team’s success in terms of wins in the regular season. To do so sample data was taken from the Cincinnati Reds from 1970 to 1991. Wins was the dependent variable and wins, fielding percentage, Earned Run Average, Batting Average and Walks were the tested independent variables.
To determine the effectiveness of each variable in the model, the best model needs to be determined, testing the data as time series, differencing the variables to make the stationary and testing for other variable effects.
*Side note: The 1981 data has been excluded, as it was in incomplete year due to a strike. This brings 21 years’ worth of observations for 5 variables.
To start off with building the model, the variables’ order of integration needs to be determined.
BA = team batting average
Era= team earned run average
Fielding = team fielding percentage
Walks = total walks for the team
All the variables seem to be fairly level, with slight variation in data. Due to this, and just knowing what kind of data is being tested, it is safe to say that the variables are stationary. Further data to support this comes in the unite root test of fielding:
| Null Hypothesis: FIELDING has a unit root | ||||
| Exogenous: Constant | ||||
| Lag Length: 0 (Automatic based on SIC, MAXLAG=4) | ||||
| t-Statistic | Prob.* | |||
| Augmented Dickey-Fuller test statistic | -4.424716 | 0.0029 | ||
It is significant at the 1% level.
ERA:
| Null Hypothesis: ERA has a unit root | ||||
| Exogenous: Constant | ||||
| Lag Length: 0 (Automatic based on SIC, MAXLAG=4) | ||||
| t-Statistic | Prob.* | |||
| Augmented Dickey-Fuller test statistic | -4.806399 | 0.0013 | ||
It is significant at the 1% level.
BA:
| Null Hypothesis: BA has a unit root | ||||
| Exogenous: Constant | ||||
| Lag Length: 0 (Automatic based on SIC, MAXLAG=4) | ||||
| t-Statistic | Prob.* | |||
| Augmented Dickey-Fuller test statistic | -2.866763 | 0.0680 | ||
It is significant at the 10% level.
While tests may suggest that walks and wins is I(1), looking at the graphs and knowing the type of data, they can be determined to be I(0).
With thie taken into consideration, this model is formed.
| Variable | Coefficient | Std. Error | t-Statistic | Prob. | ||||||||||||||||||||
| C | -567.1794 | 809.8831 | -0.700323 | 0.4938 | ||||||||||||||||||||
| ERA | -0.352712 | 2.965322 | -0.118946 | 0.9068 | ||||||||||||||||||||
| BA | 451.1626 | 228.4446 | 1.974932 | 0.0658 | ||||||||||||||||||||
| FIELDING | 509.0407 | 840.6211 | 0.605553 | 0.5533 | ||||||||||||||||||||
| WALKS | 0.071144 | 0.028535 | 2.493273 | 0.0240 | ||||||||||||||||||||
| R-squared | 0.599605 | Mean dependent var | 87.28571 | |||||||||||||||||||||
| Adjusted R-squared | 0.499506 | S.D. dependent var | 11.82854 | |||||||||||||||||||||
| S.E. of regression | 8.368167 | Akaike info criterion | 7.291004 | |||||||||||||||||||||
| Sum squared resid | 1120.419 | Schwarz criterion | 7.539699 | |||||||||||||||||||||
| Log likelihood | -71.55554 | Hannan-Quinn criter. | 7.344977 | |||||||||||||||||||||
| F-statistic | 5.990136 | Durbin-Watson stat | 1.143190 | |||||||||||||||||||||
| Prob(F-statistic) | 0.003829 | |||||||||||||||||||||||
However, this model has some high prob. Values, a low R-squared and some high standard errors. To correct this, a better model should be seeked, by adding in lags or removing insignificant variables. Adding in lags gave this model:
| Variable | Coefficient | Std. Error | t-Statistic | Prob. |
| C | -2033.497 | 616.2475 | -3.299806 | 0.0063 |
| WALKS | 0.098112 | 0.019479 | 5.036708 | 0.0003 |
| FIELDING | 2126.658 | 650.9779 | 3.266867 | 0.0067 |
| ERA | 4.648875 | 2.019448 | 2.302052 | 0.0400 |
| ERA(-1) | 5.544185 | 1.682273 | 3.295651 | 0.0064 |
| BA | -5.722406 | 174.8243 | -0.032732 | 0.9744 |
| BA(-1) | -213.3401 | 119.4420 | -1.786140 | 0.0993 |
| R-squared | 0.847717 | Mean dependent var | 87.89474 | |
| Adjusted R-squared | 0.771575 | S.D. dependent var | 10.23010 | |
| S.E. of regression | 4.889355 | Akaike info criterion | 6.289308 | |
| Sum squared resid | 286.8695 | Schwarz criterion | 6.637259 | |
| Log likelihood | -52.74842 | Hannan-Quinn criter. | 6.348195 | |
| F-statistic | 11.13342 | Durbin-Watson stat | 2.865230 | |
| Prob(F-statistic) | 0.000264 | |||
This one is a better model because the R-squared is higher, the Akaike info criterion and the Scwarz criterion is lower, and the prob values and std. errors are lower. However BA is still highly insignificant. Removing it as a variable gives a better model:
| Variable | Coefficient | Std. Error | t-Statistic | Prob. |
| C | -1734.346 | 530.3656 | -3.270096 | 0.0056 |
| WALKS | 0.094522 | 0.015640 | 6.043573 | 0.0000 |
| FIELDING | 1773.763 | 541.3628 | 3.276477 | 0.0055 |
| ERA | 3.721749 | 1.522243 | 2.444911 | 0.0283 |
| ERA(-1) | 4.375090 | 1.550882 | 2.821034 | 0.0136 |
| R-squared | 0.806609 | Mean dependent var | 87.89474 | |
| Adjusted R-squared | 0.751355 | S.D. dependent var | 10.23010 | |
| S.E. of regression | 5.101170 | Akaike info criterion | 6.317751 | |
| Sum squared resid | 364.3070 | Schwarz criterion | 6.566287 | |
| Log likelihood | -55.01863 | Hannan-Quinn criter. | 6.359813 | |
| F-statistic | 14.59809 | Durbin-Watson stat | 2.226007 | |
| Prob(F-statistic) | 0.000067 | |||
Though the Akaike info criterion goes slightly up, the Schwarz criterion decreased, all of the variables become highly significant and the std error goes down.
To validate the model a serial correlation test was conducted to see if more lags were needed.
| Breusch-Godfrey Serial Correlation LM Test: | ||||
| F-statistic | 0.406332 | Prob. F(2,12) | 0.6749 | |
| Obs*R-squared | 1.205107 | Prob. Chi-Square(2) | 0.5474 | |
It is not significant, so no additional lags are needed.
In conclusion walks and ERA, as well as the team’s previous years’ era have small coefficients. As a side note, when entering a new year, opposing batters have psychological expectations of the pitcher they face, justifying the lag. Fielding however has the largest coefficient of 1773.763, suggesting it is the most determining factor in a team’s wins. This model performs well because the variable’s outside information could back this claim.
Note: The study could be expanded to include other variables.


