martes, 20 de enero de 2009

Cross year correlations

There are several stats used in baseball. Some of them are strong in predicting the future performances of the players. Some other aren't. One way of measuring that strength is using cross-year correlations of the desired variables. The following is an analysis of which stats have strong cross-year correlations.

Methodology. For pitchers I merged the data on all starters who pitched more than 50 Innings in both 2007 and 2008, a total of 115 different pitchers were retained. The same was done for position players, where 337 different players were retained that had at least 100 AB both in 2007 and 2008. It's worth noting that Spearman Correlations were used, instead of the classic Pearson Correlations, this is due to the fact that the variables used are not normal distributed, hence a non-parametric approach makes more sense.

The charts below show the Spearman Correlation of all variables analysed in descending order.

Position players (batters):


Starting pitchers:

viernes, 9 de enero de 2009

Baseball defensive metrics analysis

Introduction: Different skills on baseball players can be measured in different ways. Hitting skills can be measured via AVG, OBP, SLG or OPS, even lately some new stats have been arising like BABIP, LD%, et. al. Pitching skills have regularly been measured via ERA, W-L%, K%, etc. but lately new stats like FIP, xFIP, do better jobs that the old ones. But for fielding (defense) stats haven't been so succesful to measure fielding skills.

So here I want to do a correlation analysis on some fielding stats to see how well they predict the outcome of future performances. The stats I used were: Fielding Percentage, Range Factor in 2 versions (per game, and per 9 Inn), and UZR in its normal version and the per 150 games version.

I used the data from a site called fangraphs
http://www.fangraphs.com/ on a player by player basis, from 2007 and 2008. I merged both data sets by player-position, so that each player that played the same position in 2007 and 2008 has those variables as columns in the same observation (player-position). I then filtered out those players that didn't play for at least 90 innings in that certain position.

The Correlation results follow:



What this means is that Range Factor is best at predicting future performances, but a deeper analysis by position follows too.