The Limitations of Everyone's Favorite Metric: Stuff +
Spring has sprung, and with it we have spring training games being played and people talking about all the changes that pitchers have made over the course of an offseason. Pretty much every pitcher is going to show up with a new pitch, a new toy tasked with the job of getting hitters out at a higher clip.
In the professional game, every pitcher knows where their weaknesses and deficiencies lie. They know what they need to add or subtract in order to be better, and in the rare case they don't: they will go searching until they come up with something. The margin between success and failure in professional sports is brutally thin, and those that fall on the wrong side miss out on the exponential payout that comes with being an elite professional athlete.
These pitchers aren't the only ones who get excited about their improvements. Fans, analysts and anyone who is in too deep into the baseball world loves to see what players (especially the ones on the fringe) look like coming into camp. While lots of people think that Spring Training is a nice and easy warm up for the season, please note that the vast majority of players go in fighting for their life. I know in my experience it was pretty much my one chance to have a lot of eyes on me, since most of the year I was low on the priority list for directors/decision makers.
These players show up with a new toy and the chance at a new first impression, something that doesn't come around very often. One of the ways these new pitches or arsenals are evaluated is in “Stuff+” a metric that rates the pitch based on similar pitches that have been thrown and their outcomes. These ratings are done in two ways: an internal one that teams use in order to evaluate and gain an edge on the competition, and done by analysts all over social media who want to show their abilities to analyze and create models for pitching.
With a pitch that has league average outcomes being rated 100, a pitch that is 110 would be projected to perform 10% better than league average (10 points above 100). On the flip side, a pitch that is rated a 90 stuff+ is projected to perform 10% worse than league average (10 points below 100).
This year in particular, there have been about 1 million people who create stuff+ models, most of which are just blobs of code that were vibe coded out of ChatGPT. They attempt to create a model that explains and projects performance at the major league level, but often times can't do so at a high enough level to actually have a model worth looking at.
The first issue: John Creel is Better Than Shohei Ohtani
If you finish reading this article and want to believe that I am better than Shohei Ohtani, please do so. I will allow you to go anywhere and tell anyone how great I am. In fact, I support you and appreciate your service.
One of the more popular (and respected) analysts on twitter is a page named TJStats. I don’t have anything negative to say about him and his work, but we should talk about the limitations of an imperfect metric.
Attached are two pictures from TJ’s page:
Go ahead and take these numbers and run with them. Post them anywhere and everywhere you want. However, I hope I don't have to try too hard to convince you that I am not in fact better than Shohei Ohtani.
So let's talk about where these numbers start to dip in value:
Games Will NEVER Be Zero Stress
Stuff+ attempts to predict how a pitch would perform in Major League Baseball. It takes the traits of a pitch and compares it to the already occurred outcomes of similar pitches that have been thrown in Major League Baseball.
The sweeper John Creel throws in a zero pressure bullpen will probably be a little bit better than the one he throws when he is behind 2-0 to the hitter leading the Atlantic League in Homeruns. The cutter Creel throws when he is trying to get ahead is not thrown as aggressively as the cutter he throws when he is up 1-2.
Pitch usages matter, and pitches thrown when you have actual skin in the game mean way more than the ones thrown in a bullpen where your only incentive is to throw the nastiest pitches possible. Stuff+ does not take into account a lot of the things that come around when you go through the season. Sample size, weather conditions, travel schedule, crowd effects all go into performance.
Pitch Spreads Matter
A sweeper thrown at -2hb will play way better when it's paired with a 2 seam fb that is thrown at 20hb rather than a 4 seam fastball that has 4hb. How much separation a pitcher creates within his arsenal matters. However, when you take a metric like stuff+, all it does is compare this cutter to other cutters, some really advanced models can do a better job, but most do not.
The other pitches you throw will always impact how a pitch is treated by a hitter. If a hitter is sitting on a fastball that runs nearly two feet towards him, anything that goes the other direction is going to perform much better than expected.
Usage Matters
No matter how good a pitch is, the more a hitter sees it the less effective it will be. Your 125 stuff+ rated slider thrown 7 times in a row won’t be as good the 7th time as it was the first.
Not All Ratings Are Created Equal
This one gets a little more in depth, but essentially the scores created within stuff+ models are based on prior occurrences of this pitch shape. If this particular pitch shape is rare, it can be carried by the fact that the few pitchers who do throw it happen to throw it exceedingly well. The rating of it could also be bogged down if the pitcher relies heavily on it, or his other arsenal is not good and it allows hitters to sit on this particular pitch.
In addition, sometimes outlier shapes (that arent good, but simply aren't thrown that much) will receive a major boost to their ratings because the sample size is small and the outcomes are skewed in one direction. A pitch that has been thrown by 2 or 3 pitchers at the major league level with success is not as good as the pitch that is thrown by 100 major league pitchers.
For example, the sweeper was an absurdly effective pitch in the year 2021. However its effectiveness has dropped significantly as hitters have seen more and more. In 2021 a 77 miles per hour sweeper with -20hb rated at about 115, this is roughly the same as an 87 mph gyro at the time. Today this same sweeper would rate significantly lower on an updated model, likely under 95 stuff+. However the gyro slider (one that hundreds of major league pitchers throw every year) is still an above average offering, probably just as good as it was in 2021.
Stuff+ is Still a Good Metric
Despite everything that I have said, this metric is still a net positive for the game of baseball, talent evaluation and acquisition. My main gripe is with people who hold it as gospel and don’t recognize its shortcomings. If I was evaluating a player, stuff+ would absolutely be a key component to my acquisition process, however I would take it with a grain of salt, recognize its shortcomings and do more to look into a pitcher's effectiveness.
Proformance Consults:
Starting tomorrow, I am offering 1 hour video consults to all premium subscribers, as a complementary service.
These consults will be for you to talk with me 1 on 1 and have me offer feedback from my 23 years of playing the game at the amateur and professional level. We can talk player development, recruiting, coaching and anything in between.
John