From: Subject: Journal of Statistics Education, V6N2: Watnik Date: Tue, 12 Feb 2008 19:35:50 -0500 MIME-Version: 1.0 Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Content-Location: file://C:\Documents and Settings\Alan Izenman\Desktop\Journal of Statistics Education, V6N2 Watnik.htm X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 Journal of Statistics Education, V6N2: = Watnik

Pay for Play: Are Baseball Salaries Based on Performance?

Mitchell R. Watnik
University of Missouri-Rolla=20

Journal of Statistics Education v.6, n.2 (1998)=20

Copyright (c) 1998 by Mitchell R. Watnik, all rights reserved. This = text may=20 be freely shared among individuals, but it may not be republished in any = medium=20 without express written consent from the author and advance notification = of the=20 editor.


Key Words: Exploratory data analysis; Model = selection and=20 validation; Regression; Stepwise model selection.=20

Abstract

Well-defined measures of performance are readily available for = baseball=20 players, making the modeling of their salaries a popular statistical = exercise.=20 In this article, the salaries for non-pitchers for the 1992 Major League = Baseball season are provided, along with numerous measures of the = players'=20 previous year's performances. Also included are indicators of each = player's=20 ability to switch teams. This dataset is useful in upper-division = regression=20 analysis courses because it exhibits many "real world" difficulties that = can be=20 remedied using techniques outlined in the course.=20

1. Introduction

1 Linear regression is a core course in most statistics programs, and = many=20 science and social science programs also employ linear regression = techniques.=20 Frequently, textbooks provide datasets that exhibit properties discussed = in the=20 section immediately preceding the question involving the data, but = datasets=20 whose analysis requires many of the techniques covered in even an = elementary=20 course are harder to come by.=20

2 The dataset discussed here is different. Its analysis can result in = a model=20 that explains much of the variation in salary, but the process requires=20 employment of many of the techniques covered in regression analysis = courses.=20 This type of dataset was used with great success as part of the 1988 ASA = Data=20 Analysis Exposition; see Denby=20 (1988) or Hoaglin=20 and Velleman (1995). A "good" analysis of this dataset results in a = model=20 that has many features interesting to economics students or students who = are=20 baseball fans. I used this dataset as part of a take-home final in my=20 upper-division regression modeling class.=20

2. The Dataset

3 The dataset consists of information about Major League Baseball = players.=20 The response variable is their 1992 salaries (measured in thousands of = dollars=20 and obtained from the New York Times of November 19, 1992). = Possible explanatory variables include various measures of the players' = 1991=20 performance. (See the Appendix=20 for descriptions of each variable.) These data were obtained from the=20 Sacramento Bee of October 15, 1991. Students who are not = familiar=20 with baseball may be made aware that, with the exception of strike-outs = and=20 errors, all of these variables would sensibly be positively correlated = with=20 salary.=20

4 The last four numeric variables are dummy variables indicating = "free agency=20 eligibility," "free agent in 1991/2," "arbitration eligibility," and=20 "arbitration in 1991/2." The special 1991/2 dummy variables are used = because the=20 players' union argued that owners colluded to keep the salary of free = agents in=20 1991-2 lower. A list of free agents was obtained from the New York = Times of November 13, 1991, and a list of players undergoing = arbitration=20 in 1992 was published in the New York Times on February 23, = 1992.=20 The reason these variables are important is that, at the time, baseball = had=20 rules stating that a player could not go to the team of his choice = unless he was=20 "free agent eligible," and he could only be eligible if he had a certain = amount=20 of experience. From an economics point of view, it seems reasonable that = if a=20 player is not able to market himself to the highest bidder, his salary = will not=20 be as high. At the time, "arbitration" was for players who did not have = enough=20 experience to be free agents, but had some experience in the league. In = this=20 case, the player and his team would go to an appointed "arbitrator" who = would=20 choose between the player's suggested salary and the team's suggested = salary.=20 Players who were neither "free agent eligible" nor "arbitration = eligible" either=20 accepted what their team was willing to pay them or did not play.=20

5 There are some possibly significant interactions between the four = dummy=20 variables and the quantitative variables "runs," "runs batted in," "home = runs,"=20 and "batting average." Therefore, for example, an analyst might choose = to=20 multiply runs by each of the four dummy variables to get four = interaction terms.=20 These interactions could be interesting, if determined significant, = because home=20 runs and batting average are measures of individual performance, while = runs and=20 runs batted in are measures of a player's contribution to the team. If = free=20 agents increase their salaries for better individual performances, it = would give=20 one some insight into the priorities owners use to determine their = salary=20 structures. Similarly, the significance of interaction terms involving=20 arbitration, if any, would give insight into the arbitrators' = decision-making=20 process.=20

6 The last variable in each data row is the player's name, enclosed = in double=20 quotes. If your software has difficulty handling this text data, you may = choose=20 to manually delete the character information. This information was = obtained from=20 the Society for American Baseball Research (SABR) at=20 ftp://skypoint.com/pub/members/a/ashbury/sabr/SALARIES = /1992_salaries_baseball=20 and CNN Sports Illustrated at=20 http://www.cnnsi.com/baseball/mlb/historical_profiles/. The careful = reader might=20 observe that some players' salaries as listed on the SABR web site = differ from=20 the ones in the dataset -- especially the outlying observations pointed = out in=20 Section 3. This is caused by SABR's using salaries on Opening Day, while = the=20 salaries obtained from the New York Times are recorded as = of the=20 trading deadline on September 1, 1992.=20

7 Referees noted that career variables such as number of games played = or=20 number of at-bats, which I have left out, could have an impact on the = model.=20 This is, of course, true. I would welcome hearing of any such discovery. = Students can obtain career data for players at the CNNSI site referred = to above.=20

3. The Analysis

8 Once all of the explanatory variables are in place, students may = start=20 their analyses. First, they should obtain a histogram of the response = values and=20 notice that it is highly right-skewed. (This implies that a few players = are=20 making substantially more than the rest.) They may model the response = using all=20 of the independent variables and look at a QQ plot of the residuals; the = residuals are also heavy-tailed. Therefore, the response should be = transformed.=20 Taking the log of the salary is an appropriate transformation here, but = it is=20 worth noting that the response is now the log of salary; the = interpretations of=20 the beta estimates will differ from those for the players' = actual=20 salaries.=20

9 After transforming the response variable, students can start = employing=20 their stepwise or other model-building techniques. This dataset is large = enough=20 to allow splitting it in order to use part of the data to select a model = and the=20 rest to be held aside for model validation. Nonetheless, computer = programs may=20 not be able to do an exhaustive search since there are around 30 = independent=20 variables (including interactions).=20

10 Once candidate models are obtained, students should be encouraged = to=20 obtain a QQ plot of the residuals and a plot of the residuals versus the = fitted=20 values. Both indicate that there are outliers! The same three outliers = were=20 consistently identified in my students' final models: Gary Pettis, Juan = Samuel,=20 and Lance Parrish. (See the comment in Section=20 2 about differences between SABR and the dataset provided here.) = These=20 outliers are influential, and, as it turns out, they are "unfairly" = included in=20 the dataset because these players were actually paid much more than the = dataset=20 implies. However, because of a few obscure rules in baseball regarding = the=20 "waiving" of players, their salaries from their current teams were = substantially=20 less than they should have been. That is, these players were actually = being paid=20 a much higher salary, but by their former teams. Students will thus have = to find=20 those observations and delete them.=20

11 After deleting the outliers, students can start to seriously think = about=20 model selection techniques. There may be two or three "good" models = singled out=20 by any criterion, and different criteria often point to different = models, as is=20 the case with this dataset. After their previous experience with the = data not=20 conforming to ideal standards, the students should know to plot the = fitted=20 values against the residuals and to obtain QQ plots for each of their = models.=20 Now, it is up to the students to choose a final model and justify it!=20

12 As part of this assignment, I asked my students to interpret at = least one=20 of the parameter estimates for the quantitative variables in terms of a = player's=20 estimated salary (as opposed to log salary). Similarly, I asked them to=20 interpret at least one of the parameter estimates for the dummy = variables. In=20 this non-standard scenario, the interpretation of the estimates is not = "as=20 Xi increases by one unit, the estimated mean increase = in=20 Y is b units, holding all other Xj = variables=20 constant," and I wanted to keep my students from getting in the habit of = using=20 that interpretation without thinking about the situation. My linear = regression=20 course had a substantial number of graduate students, and I always try = to ask=20 questions requiring interpretation of results.=20

13 It turned out that, in all of the different models my students = chose as=20 their "final" models, the estimate for the constant term was remarkably = close to=20 the natural log of the minimum salary in 1992. This led the students who = made=20 that discovery to state that at least that statistic had an intuitively = sensible=20 value in their interpretation. Furthermore, it strengthened their belief = in the=20 model building process and in their choice for the final model.=20

4. Final Comments

14 This paper discusses the modeling of baseball players' salaries as = a=20 function of their performance the previous year and their ability to = market=20 their skills to other teams. The process of properly analyzing this = dataset=20 requires students in a linear regression course to employ many of the = tools=20 introduced in such a course -- including diagnostics of the assumptions=20 associated with standard linear regression and remedial measures to be = taken=20 when the assumptions are not met -- because it has a few properties not = found in=20 "textbook cases." My original analysis of this dataset led me to = discover that=20 different information criteria chose different models, and, because the = selected=20 models were not nested, standard hypothesis testing procedures did not = apply. I=20 investigated non-nested model selection tests and wrote my Ph.D. thesis = on the=20 subject.=20

5. Getting the Data

15 The file bas= eball.dat=20 contains the raw data. The file bas= eball.txt=20 is a documentation file containing a brief description of the dataset.=20

Acknowledgments

I wish to thank Richard Green of the UC Davis Agricultural Economics=20 Department for encouraging me to pursue the use and publication of this = study=20 beyond his econometrics course. In addition, I offer thanks to Tom = Kirchoff, the=20 anonymous referees employed by the Journal of Statistics = Education=20 on my paper, and its section editor and editor, Robin Lock and Jackie = Dietz,=20 respectively, for their constructive criticisms and suggestions on the = final=20 draft which improved this paper. I take full responsibility for any = typos in the=20 dataset and any errors in the text of this paper which may remain.


Appendix - Key to Variables in baseball.dat

      =
Columns
       1 -  4  Salary (in thousands of dollars)
       6 - 10  Batting average
      12 - 16  On-base percentage (OBP)
      18 - 20  Number of runs
      22 - 24  Number of hits
      26 - 27  Number of doubles
      29 - 30  Number of triples
      32 - 33  Number of home runs
      35 - 37  Number of runs batted in (RBI)
      39 - 41  Number of walks
      43 - 45  Number of strike-outs
      47 - 48  Number of stolen bases
      50 - 51  Number of errors
           53  Indicator of "free agency eligibility"
           55  Indicator of "free agent in 1991/2"
           57  Indicator of "arbitration eligibility"
           59  Indicator of "arbitration in 1991/2"
      61 - 79  Player's name (in quotation marks)
Players' batting averages are calculated as the ratio of number of = hits to=20 the number of hits plus the number of outs. On-base percentage is the = ratio of=20 number of hits plus the number of walks to the number of hits plus the = number of=20 walks plus the number of outs. Therefore, the batting average is less = than or=20 equal to the on-base percentage. A batting average above .300 is very = good; OBP=20 above .400 is excellent. An RBI is obtained when a runner scores as a = direct=20 result of a player's at-bat.=20

I believe that number of hits serves as a proxy for the amount of = playing=20 individuals did in the year. There is a statistic for number of games = played=20 available, but this statistic counts any entry into the game, even = defensive=20 participation for a single out, the same as participating for the entire = contest.


References

Denby, L. (1988), Dataset from Poster Session = sponsored by=20 the Section on Statistical Graphics of the American Statistical = Association, on=20 Statlib, ed. Michael Myers. = (http://stat.lib.cmu.edu/datasets)=20

Hoaglin, D., and Velleman, P. (1995), "A = Critical Look at=20 Some Analyses of Major League Baseball Salaries," The American=20 Statistician, 49, 277-285.


Mitchell R. Watnik
Department of Mathematics and Statistics=20
University of Missouri-Rolla
Rolla, MO 65409-0020

mwatnik@umr.edu


Return = to Table=20 of Contents | Return to the = JSE Home=20 Page