A machine learning model could better measure the performance of baseball players

In the movie “Moneyball,” a young economics grad and a cash-strapped Major League Baseball coach present a new way to assess the value of baseball players. Their innovative idea of ​​calculating statistical data and player salaries allowed the Oakland A’s to recruit quality talent overlooked by other teams, completely revitalizing the team without going over budget.

New research at Penn State College of Information Sciences and Technology could have a similar impact on the sport. The team developed a machine learning model that could better measure the short- and long-term performance of baseball players and teams, compared to existing statistical analysis methods for sports. Building on recent advances in natural language processing and computer vision, their approach would completely change and potentially improve the way a game’s state and a player’s impact on the game are measured.

According to Connor Heaton, a doctoral student at the College of IST, the existing family of methods, known as sabermetrics, relies on the number of times a player or team achieves a discrete event, such as hitting a double or a home run. . However, it does not take into account the surrounding context of each action.

“Think of a scenario where a player recorded a single in his last plate appearance,” Heaton said. “He could have hit a dribbler on the third base line, got a runner forward from first to second and beat the pitch to first, or hit a ball into deep left field and comfortably reached first base, but hadn’t the speed to push for a double. Describing the two situations as resulting in “one” is accurate but does not tell the whole story.

Heaton’s model instead learns the meaning of in-game events based on the impact they have on the game and the context in which they occur, and then produces numerical representations of players’ impact on the game. game by viewing the game as a sequence of events.

“We often talk about baseball in terms of ‘this player had two singles and a double yesterday’ or ‘he had one in four,'” Heaton said. “A lot of the ways we talk about the game are just summing up events with a summary stat. Our work tries to take a more holistic picture of the game and get a more nuanced computational description of how players impact the game. »

In Heaton’s new method, he exploits sequential modeling techniques used in natural language processing to help computers learn the role or meaning of different words. He applied this approach to teach his model the role or significance of different events in a baseball game – for example, when a batter hits a single. Next, he modeled the game as a sequence of events to provide new insight into existing statistics.

“The impact of this work is the framework that is offered for what I like to call ‘interrogating the game,'” Heaton said. “We see it as a sequence in all this computer scaffolding to model a game.”

The output of the model can effectively describe a player’s influence on the game in the short term, or its form. Displayed as 64-element vectors – achieved by adapting computer vision work – these shape embeddings capture a player’s influence in-game and can be used effectively to depict their short-term impact, such as duration of 15 plate appearances, or averaged together to analyze longer periods, such as during the player’s career. Additionally, when combined with traditional sabermetrics, form embeddings can predict the winner of a game with over 59% accuracy.

Heaton described how the embeddings created by both his method and the traditional sabermetric method plot the same data. When viewed over time, sabermetric-based depictions of player impact can be somewhat sporadic, changing significantly from game to game. Heaton’s method helps “smooth” the way players are portrayed over time, while allowing for fluctuation in player performance.

“Both integrations can help differentiate good players from bad players,” Heaton said. “But ours provides a lot more nuance on exactly how good players impact the game.”

To train their model, the researchers used data previously collected from systems installed in major league stadiums that track detailed information about every pitch thrown, such as player positioning on the field, base occupancy , velocity and rotation of the terrain. They focused on two types of data: step-by-step data, to analyze information such as step type and launch angle; and season-by-season data, to investigate position-specific information such as walks and hits per inning pitched for pitchers and on-base plus slugging percentage for hitters.

Each pitch in the collected dataset has three identifying characteristics: the game in which it occurred, the in-game batting number, and the in-batting pitch number. Using these three bits of information, researchers were able to completely piece together the sequence of events that make up an MLB game.

The researchers then identified 325 possible game changes that could occur when a pitch is thrown, such as changes in the number of ball hits and base occupancy. They combined this information with existing stepping data that describes pitching and batting action, then grabbed player records from sabermetrics to be able to describe what happened, how it happened past and who was involved in each game.

The work blends Heaton’s research focus on natural language processing with his interest in the historical statistical analysis of baseball.

“There’s this whole ecosystem built around modeling language and word sequence,” Heaton said. “It seems there was potential for it to be adopted for modeling sequences of other things; to generalize it a bit. I started thinking about sports analysis and it seemed like there was a lot to be done to improve both our understanding of the game and how the game is computer modelled. »

The researchers hope their work will serve as a solid starting point toward a new way of describing the impact of baseball and other sports athletes on the course of the game.

“This work has the potential to significantly advance the state of the art in sabermetrics,” said Prasenjit Mitra, professor of information science and technology and co-author of the paper. “To the best of our knowledge, ours is the first to capture and represent a nuanced state of the game and use that information as context to assess individual events that are counted by traditional statistics – for example, by automatically building a model which includes key moments and clutch events. »

Heaton and Mitra presented their paper, “Using Machine Learning to Describe Player Impact on Play in MLB,” was one of seven finalists in the 2022 Research Paper Competition at the MIT Sloan Sports Conference Analytics earlier this month.

You can find more information about the competition, as well as links to the article, its open source code and its data at: https://www.sloansportsconference.com/research-paper-competition

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *