Baseball Stats Econometrics Tutorial

The purpose of this study is to use regression analysis on baseball statistics to determine the deciding factor in a team’s success in terms of wins in the regular season. To do so sample data was taken from the Cincinnati Reds from 1970 to 1991. Wins was the dependent variable and wins, fielding percentage, Earned Run Average, Batting Average and Walks were the tested independent variables.

To determine the effectiveness of each variable in the model, the best model needs to be determined, testing the data as time series, differencing the variables to make the stationary and testing for other variable effects.

*Side note: The 1981 data has been excluded, as it was in incomplete year due to a strike. This brings 21 years’ worth of observations for 5 variables.

To start off with building the model, the variables’ order of integration needs to be determined.

BA = team batting average

Era= team earned run average

Fielding = team fielding percentage

Walks = total walks for the team

 

All the variables seem to be fairly level, with slight variation in data. Due to this, and just knowing what kind of data is being tested, it is safe to say that the variables are stationary.  Further data to support this comes in the unite root test of fielding:

 

Null Hypothesis: FIELDING has a unit root
Exogenous: Constant
Lag Length: 0 (Automatic based on SIC, MAXLAG=4)
t-StatisticProb.*
Augmented Dickey-Fuller test statistic-4.4247160.0029

It is significant at the 1% level.

ERA:

 

Null Hypothesis: ERA has a unit root
Exogenous: Constant
Lag Length: 0 (Automatic based on SIC, MAXLAG=4)
t-StatisticProb.*
Augmented Dickey-Fuller test statistic-4.8063990.0013

 

It is significant at the 1% level.

BA:

 

Null Hypothesis: BA has a unit root
Exogenous: Constant
Lag Length: 0 (Automatic based on SIC, MAXLAG=4)
t-StatisticProb.*
Augmented Dickey-Fuller test statistic-2.8667630.0680

It is significant at the 10% level.

While tests may suggest that walks and wins is I(1), looking at the graphs and knowing the type of data, they can be determined to be I(0).

 

With thie taken into consideration, this model is formed.

 

VariableCoefficientStd. Errort-StatisticProb.
C-567.1794809.8831-0.7003230.4938
ERA-0.3527122.965322-0.1189460.9068
BA451.1626228.44461.9749320.0658
FIELDING509.0407840.62110.6055530.5533
WALKS0.0711440.0285352.4932730.0240
R-squared0.599605Mean dependent var87.28571
Adjusted R-squared0.499506S.D. dependent var11.82854
S.E. of regression8.368167Akaike info criterion7.291004
Sum squared resid1120.419Schwarz criterion7.539699
Log likelihood-71.55554Hannan-Quinn criter.7.344977
F-statistic5.990136Durbin-Watson stat1.143190
Prob(F-statistic)0.003829

 

However, this model has some high prob. Values, a low R-squared and some high standard errors. To correct this, a better model should be seeked, by adding in lags or removing insignificant variables. Adding in lags gave this model:

 

VariableCoefficientStd. Errort-StatisticProb.
C-2033.497616.2475-3.2998060.0063
WALKS0.0981120.0194795.0367080.0003
FIELDING2126.658650.97793.2668670.0067
ERA4.6488752.0194482.3020520.0400
ERA(-1)5.5441851.6822733.2956510.0064
BA-5.722406174.8243-0.0327320.9744
BA(-1)-213.3401119.4420-1.7861400.0993
R-squared0.847717Mean dependent var87.89474
Adjusted R-squared0.771575S.D. dependent var10.23010
S.E. of regression4.889355Akaike info criterion6.289308
Sum squared resid286.8695Schwarz criterion6.637259
Log likelihood-52.74842Hannan-Quinn criter.6.348195
F-statistic11.13342Durbin-Watson stat2.865230
Prob(F-statistic)0.000264

This one is a better model because the R-squared is higher, the Akaike info criterion and the Scwarz criterion is lower, and the prob values and std. errors are lower. However BA is still highly insignificant. Removing it as a variable gives a better model:

 

VariableCoefficientStd. Errort-StatisticProb.
C-1734.346530.3656-3.2700960.0056
WALKS0.0945220.0156406.0435730.0000
FIELDING1773.763541.36283.2764770.0055
ERA3.7217491.5222432.4449110.0283
ERA(-1)4.3750901.5508822.8210340.0136
R-squared0.806609Mean dependent var87.89474
Adjusted R-squared0.751355S.D. dependent var10.23010
S.E. of regression5.101170Akaike info criterion6.317751
Sum squared resid364.3070Schwarz criterion6.566287
Log likelihood-55.01863Hannan-Quinn criter.6.359813
F-statistic14.59809Durbin-Watson stat2.226007
Prob(F-statistic)0.000067

Though the Akaike info criterion goes slightly up, the Schwarz criterion decreased, all of the variables become highly significant and the std error goes down.

To validate the model a serial correlation test was conducted to see if more lags were needed.

 

Breusch-Godfrey Serial Correlation LM Test:
F-statistic0.406332Prob. F(2,12)0.6749
Obs*R-squared1.205107Prob. Chi-Square(2)0.5474

It is not significant, so no additional lags are needed.

 

In conclusion walks and ERA, as well as the team’s previous years’ era have small coefficients. As a side note, when entering a new year, opposing batters have psychological expectations of the pitcher they face, justifying the lag. Fielding however has the largest coefficient of 1773.763, suggesting it is the most determining factor in a team’s wins. This model performs well because the variable’s outside information could back this claim.

 

Note: The study could be expanded to include other variables.