Applications of Generative AI in Sport: Q2 Update, Part I

We are back with the latest instalment of our Latest Trends in AI in Sport series, written by our Chief Scientist Patrick Lucey. In part one, he considers the latest game-changing insights available from player tracking data, made possible by a combination of computer vision and generative AI.

The pace of innovation in the field of AI shows no signs of abating – first with OpenAI’s release of GPT-4o and then Google’s annual I/O conference this week. Two things leapt out at me from a sports perspective:

The CTO of OpenAI mentioning that a next step for GPT-4o could be to “watch” a live sports game and “explain the rules to you”, and
What Google’s AI-powered search – specifically “visual search” – can unlock.

This got me thinking – “What does it mean to watch and search a game of sport like soccer?” When watching a game of soccer is it enough to highlight the teams on the pitch, and then look up the rules on Wikipedia and provide a summary? That might suffice for a child or someone who hasn’t seen the game before.

But for most fans around the world, they are truly engaged in the sport and want more information at a granular level. Such questions include: did the player make the right pass?; are the defenders in the right position?; is the team getting tired or not?; how successful is the team when they run this specific play?

The promise of AI Agents is not just to watch a game like a novice, but to watch a game like an expert. But to understand the game like an expert, the AI system needs to be trained on the specific language of sport, which is based on the data we collect every day (both event data and tracking data).

Tracking data (i.e., the visual “x’s and o’s of players movements”), especially when combined with event data (i.e., the events that occurred and who they occurred with) unlocks the ability for a AI system to “watch” a sports game like an expert and analyse plays in detail, to generate specific, valuable insights for coaches and fans. It also enables us to do visual searches of live sports action, opening up further analytical and predictive applications.

In the next article, we will do a deep dive on how we can do this, but first it is necessary to understand how this critical input – player tracking data at scale – is actually collected. In this article we will do a deep dive on this topic.

Before going into that detail, let’s first look at what computer vision tracking data was, what it’s becoming, and how it is being applied to help teams and athletes reach the highest performance levels.

Player (and ball) tracking using computer vision (CV) – a quick initial history

It’s a little-known fact that the integration of computer vision (CV) systems in sports represents one of the earliest successful commercial deployments in any field. Proof, if it were needed, about how much sports fans and coaches want to know about the game!

The use of CV tracking in sports dates back to 1996 where it was initially used to track the puck in NHL games using an infra-red tracking system in real-time, otherwise known as “glow-puck” (around the same time virtual advertisements were placed on baseball broadcasts). The “yellow” first and ten line in American Football followed shortly after in 1997, and then the “World Record Line” in Olympic sports like swimming and sprinting was launched for the 2000 Sydney Olympic games. The first ball tracking technology was developed in 2000 by Hawk-Eye and used during a broadcast during a cricket match in 2001.

The first player tracking system used in the English Premier League dates back to 1998. That system utilized a multi-camera set-up to capture the video of the game from all angles, and then relied on humans to manually annotate the location of the players.

A decade later, fully automated camera-based CV systems for player tracking were deployed. Shortly after, systems that generated broadcasts automatically for lower-tier sporting competitions emerged. A lot of the sports highlight clips that you might enjoy online have been automated for well over a decade as well, but these methods tend not to use player tracking data – they mostly utilize a mix of human collected event data, audio (i.e., loud crowd noise), as well as CV based scene detection changes (e.g., zoom in on a player, then crowd, then coach, then close-up to the player again, then back main camera view).

Wearables like GPS and RFID also emerged in the early 2000s. Many fans might assume these are the primary sources of tracking data in live soccer. In fact, CV remains the preferred method for collecting player tracking data within an elite live soccer match due to its unobtrusiveness and scalability.

How do computer vision (CV) systems work?

First, let’s define computer vision (CV) and its place in AI.

CV is the science of enabling computers to comprehend digital images and/or videos. Therefore, when we refer to a CV system, we are essentially discussing an AI system.

To employ a CV system for collecting tracking data from an elite-level sporting event, such as a soccer match, the process traditionally began with a high-definition video capture system.

This system comprises cameras strategically positioned around the venue, essentially serving as the “eyes” to capture the on-field action.

These high-definition cameras may be set up from a single viewpoint (to minimize the hardware footprint and for ease of set-up/tear-down), or distributed across various locations around the pitch.

Once we have the video capture system set-up, these “eyes” transmit the visual data to a computer, which then transforms the raw visual information into a format understandable by the computer. This format may manifest in 2D “dots” or in 3D “skeletons”.

The steps involved in this transformation include:

Player and ball detection: This involves identifying the location of players and the ball in each image. For player detection, depending on the granularity of measurement required and the pixel density of the input image, this can be achieved by detecting bounding boxes around the player in the image or by detecting the skeleton or silhouette of each player. For ball detection, a bounding box is normally utilized.
Team and player identity: Following the detection stage, the next step is to identify the team to which each player belongs (usually based on the color of their jersey) and the player’s identity (typically determined by identifying the player’s jersey number). When a player is occluded (i.e., is not visible) for a period of time, this task is often referred to as “re-identification”.
Camera calibration: This step involves detecting the lines and corners on the pitch, which are then used to map the player and ball positions to real-world coordinates.
Tracking: Finally, the detections are associated with a single identity over the course of the match. This can be done both in the image plane (i.e., the pixels we see) and the pitch plane (i.e., the top-down view of the pitch). Normally in sports, the approach of “tracking by detection” is utilized, but often we get missed or false detections, hence the need for a tracker. Because there are many players on the field, we call this “multi-object tracking”.

Deep learning methods are normally employed for each of these steps. For example, Convolutional Neural Networks (CNNs) are normally utilized to detect the player/ball, but also form the input representation for team and player identification. Segmentation models are often used in conjunction with line/corner detectors for calibration. To train these models, an enormous amount of training examples of the raw images with associated bounding boxes (or skeletons), team-id and player-id, as well as edge/corner locations are required. In some situations, also automatically understanding the scoreboard via optical character recognition (OCR) is required. An example of all these steps is illustrated below.

Later in the article, we will connect how these deep learning methods are related to the trend of utilizing GenAI methods – but at a high-level, you could think of the process here as creating the visual language of sport (i.e., the x’s and o’s) – which lends itself to downstream language modelling.

Why and when do CV systems use either “dots” or “skeletons” to detect & track players?

It’s helpful to conceptualize a CV system as a sensing or measuring tool. The precision required for the measurement—whether in millimeters or centimeters—determines the type of tracking output needed. These can be categorized into:

Fine-Grained Measurements (millimeter accurate): This encompasses officiating tasks (e.g., semi-automatic offside detection in soccer, pitcher analysis in baseball and officiating in basketball) and broadcast graphics (e.g., segmentation of photorealistic avatar generation of athletes and augmented broadcasts).
Coarse-Grained Measurements (centimeter accurate): These relate to fitness measures of players during a match (e.g., how far they ran, how many high-intensity sprints) as well as tactical measurements (e.g., which formation a team played, how well did a player execute a pass, or in basketball if team utilized a pick-and-roll).

For fine-grain measurements such as semi-automatic offside detection and photorealistic avatars, skeletal tracking is necessary as it provides detailed 3D information for these use cases.

On the other hand, bounding box detection is sufficient for coarse-grain measurements, enabling estimation of a player’s “center-of-mass,” resulting in 2D “dots”. An example showing the difference between center of mass tracking (top) and body-pose tracking (bottom) is given below which was taken from a paper we wrote on the topic.

How is the raw visual information separated into useful and non-useful data?

Historically, when we think about tracking data, it has been the utilization of the 2D dots representing players moving all over the field/court. People often think of this type of tracking data as “big” data. However, it’s the opposite—the tracking system acts as a compression tool, extracting only essential information from the raw video pixels, such as player and ball positions and motions, while discarding extraneous details like grass, crowds, and advertisements.

This compression ratio can be as high as 1,000,000:1. Therefore, tracking data in sports can be likened to the ultimate video compression algorithm or a sports-specific codec, enabling various downstream applications.

From these measurements, tracking data can be utilized in numerous additional ways, which expand exponentially in utility if the tracking data can be combined with event data, showing not just where a player is but what they’re doing. This includes interactive search, simulation, strategy analysis, and mixed reality applications. While future articles will delve deeper into these applications, our focus here is on the underlying computer vision technology.

If computer vision tracking has been around so long, why isn’t it already used everywhere?

Some top-tier sports leagues employ in-venue computer vision tracking hardware and systems, utilizing multiple specialist fixed cameras installed around the venue, like Stats Perform’s SportVU.

These systems generally provide coarse-grained positional and movement data outputs. Even these outputs only provide part of the picture and still need to be merged with “event data” as mentioned above and later. Further, access is restricted to the team owning the venue, or is shared between teams in that specific league for tactical analysis. Very rarely is the data shared outside that league. Insights derived are sometimes also seen in analysis on TV.

Both the hardware cost, the complex process to merge tracking and event data, and the analyst resource required to extract actionable insights from camera tracking data means application of fixed CV camera systems is very limited outside the major leagues.

It also means that even whilst big teams/leagues may have been able to access tracking data within their own league, they still have material blind spots. They can’t get access to such data from other leagues and competitions. This creates huge constraints when scouting players to recruit from these leagues, when preparing to play teams from other leagues in cup competitions, or to play against new players or coaches from other leagues.

Single-competition tracking data access also limits the amount of data analysts at teams have to develop and train models to make specific predictions about playing styles, patterns and to simulate different tactics. That means those predictions and simulations are limited in their scale and value.

For “officiating”, which requires millimeter precision, an even more significant amount of hardware is required within the venue, such as high-resolution cameras. This not only incurs substantial incremental costs but also presents operational challenges, as access to the venue and reliable heavy-duty internet connections are essential, which may not be available in all venues.

Even with extensive hardware installations in arenas, sometimes additional measures are necessary. For instance, during the 2022 FIFA World Cup, semi-automatic offside detection technology supplemented computer vision-based player tracking data by incorporating RFID chips in the ball. Similarly, in sports like cricket, drone footage complements existing systems to capture fielding positions, while the NFL and NHL mandate players to wear wearable RFID chips, further expanding the hardware footprint.

The good news is that for coarse-grained measurements such as fitness tracking and tactical insights, the extensive hardware infrastructure is now no longer a prerequisite. Using generative AI and deep data, a scalable solution comprising both tracking and event data can be achieved without additional hardware, thereby enabling backward compatibility, enormous coverage and cost-effectiveness. It uses widely-available remote video.

Going beyond hardware systems for coarse-grained insights, using remote video

As humans, we can understand what is occurring in a game via remote video (i.e., the video consumed outside the stadium), so it seems logical to extend a CV system to do likewise.

The potential of this is massive, especially for global sports fed by multiple elite competitions. Tracking data can be captured for the thousands of global professional men’s and women’s soccer teams, as well as the 350+ division 1 schools in basketball and myriad international basketball leagues.

It even means we can also go back in time to collect historical footage as well, from venues that didn’t have CV cameras installed.

Our specialist AI team at Stats Perform has pioneered the development of remote-tracking technology over the last 8+ years, just as we pioneered the in-venue collection of player and ball tracking data via SportVU.

Our remote-tracking journey actually began in basketball with our patented AutoStats system which launched in 2019. The key challenges with capturing tracking data from basketball remote video are to calibrate a moving camera as well as re-identify players who are in- and out-of-view.

AutoStats basketball outputs are now used for draft prospect analysis by teams like Orlando Magic and tactics as well as powering new storytelling angles in the media and on TV, such as in the 2023 FIBA Basketball World Cup.

Alongside AutoStats, we have been focusing on soccer with our Opta Vision product. The ambition with Opta Vision was similar: generate “complete tracking data” from every soccer game, that is comparable to in-venue tracking. Then combine it with event data so it’s even more valuable to analysts.

In part two of this update, Patrick will expand on how Generative AI is being applied to ‘impute’ the field location of all soccer players, outside of camera shot, during a match to provide analysts with complete, uninterrupted, tracking data for every player from the first whistle until full-time.