by David Baranov
“Federer is too old” is a statement I’ve been hearing for my entire time as a player in the tennis community; however, he is still out there, and it’s been a decade since I first heard someone say that. Age seems to be a variable of great variance and interest in the present state of tennis where two teenagers can make it to the US Open final while players like Serena Williams in her 40s remain very competitive. On the right in the picture above, is Canada’s own Félix Auger-Aliassime. He is one of the youngest players in the ATP circuit and already has 1-0 record against old Roger on the left.
This blog will focus on analyzing historical tennis data around one variable: age.
To begin, I loaded Jeff Sackmann’s data across the years. As far as I know, this data is the main source of tennis data for anyone wanting to do some analysis on this sport. Nearly all possible variables that can be recorded during a match are there. Since I want to observe historical trends, I took the data recorded for each year and combined the csv files using the command prompt. This could be made more accessible for future use if there were a “all years” folder in the github. I now have data for every year for ATP singles matches, player info and player rankings. Doubles matches and futures/challenger events are also accessible for additional results.
First, I wanted to look at how players perform across the entire ATP circuit throughout history while considering their age. I set the age of the match winners as my indicator. Since I was looking at the long term data, the age of match losers could have been used instead since there is an equal number of match winners and losers due to the structure of tennis tournaments. However, in the short term using these indicators would not be precise since they would only capture a portion of some athletes careers and so skew the effect of age if they are captured in their introductory or retirement phase.
Figure 1 shows the top 20 winner participation counts for each match of the ATP circuit from 1965 onwards, by age. Most of the missing data comes from the earliest periods of the ATP circuit. We can see that 24/25 is the most popular age. There is quite a bit of range as those under 20 and players over 30 can still rack up some wins, but your odds are best in your mid 20s so I still have a bit of time 🙂
In Figure 2, the distribution of participation by age has a normal shape and is quite symmetrical. The normal shape is not very surprising. every athlete first improves their skills and athleticism as they go from teenagers to adults and every athlete’s performance eventually declines with age. However, the width and symmetry of this graph could indicate something about tennis if compared to other sports. My first thought was that a high injury sport like American football could potentially have a narrower graph that skews more to the left, but as Figure 3 shows below that is not the case.
This data show the age distribution for the NFL Census of 2016. Although slightly more skewed, there seem to be a similar amount of anomalies when it comes to older players around the age of 40. Basically our Roger Federer is their Tom Brady. We do have a much higher number of athletes under 20 though. Current ones of interest include Canada’s own Leylah Fernandez and Félix Auger-Aliassime.
When looking at Figure 2 and the data, I noticed that Federer is the only current player who is attempting to raise the last histogram bar on the right. A couple player have helped many years ago, but now he finds himself alone at the end of it. Already an anomaly, from historical data he really should be done soon. So perhaps, the statement that this blog opened up with was not based on much data. There are plenty of players who have succeeded in their mid 30s according to Figure 1, the gossip might have begun to early.
Most people do not follow the results of every ATP Tournament winner, yet alone match. A look at the age of grand slam winner could portray a more accurate depiction of age distribution since the older players tend to save their energy for the bigger tournaments and avoid the small ATP titles. I made a plot of tournament finalist win counts by age for each grand slam:
An interesting result was that the 1st place for Wimbledon: it had ten 22 year olds who won it, while that age is only the 6th highest in overall ATP wins. This could also never be the same players since it only occurs once a year. Perhaps a look at the histograms could help:
Due to the small sample size of grand slam titles, it would be quite difficult to see a pattern if there was one in the histograms. However the spikes for 25 in the Australian Open, 22 at Wimbledon and 31 at the US Open are quite surprising. When looking at the mean age of winners across the tournaments they were all 25, except for the Australian open who had 26.7 as its mean age. Perhaps due to it being the first tournament in the grand slam season, it offers the most opportunity to the older players as they are more rested. Further along in the season the youngsters keep up while the older players wear out and occasionally drop out due to injuries as has happened with Federer and Nadal this year.
The US Open and Australian Open have the highest mean age of winners. Both them have hard court’s while the other 2 slams do not. Perhaps, a hard court surface could benefit the older players. So i took a deeper look into the court surface variable across the ATP circuit. To begin I plotted the density of match wins for the 3 main types of court suface below:
Well… these look basically identical. How about a look at the counts below:
These seem very similar as well. A further look into the data leads me to conclude that court surface wins are not significantly correlated to age. And so the difference in the age of grand slam title wins comes from other reasons than court surface. A large difference can be attributed to the specialized skills of the top performing players.
Rafael Nadal at Roland-Garros for example:
- his wins represent 24% of all the wins in history;
- he has won 105 matches out of 108 played (a 97.2% win rate).
Tennis is a very technical sport and clay is Nadal’s area of specialization, mainly caused by the location of his upbringing and physical traits. This leads me to believe that the specific edge that the top players have, contributes much more to the age differences seen in the grand slam wins than the their age on ATP wins based on court surface.
Even when considering player specific skills (should improve while under 30), and neglecting the effect of court surface, the results from this year’s US Open are still very surprising. Emma Raducanu (18 at the time) and Leylah Fernandez (19) faced in the finals. I plotted the sum of finalist ages for the US Open below to see just how unlikely this past US Open’s final match-up was.
This graph is showing the ATP circuit. Such a match-up in the ATP has occurred less than 2% of the time and I am assuming it relates to the WTA. This finalist match-up is the first of its kind, and chances are we will see players this young in the finals of the US Open no more than 2 times in our lifetime.
While witnessing this great match-up, I wondered if younger players are starting to overtake the older one at a faster rate in recent years. Federer, Nadal, Djokovic, and Murray were on top for so long, that perhaps my generation did not witness the natural cycle of up and coming tennis players and is only now witnessing it as the top 4 retreat (maybe not Djokovic). So I took a look at the historical age differences in ATP grand slam matches using the plot I created below:
I was surprised when looking at the age difference graph for specific times above, that there is a lot of volatility. It appears that every couple years the pendulum switches between the older players winning and the younger players winning. What is responsible for this cyclical pattern? Perhaps it is technological advances allowing the older players to stay in the game, or younger one to beat them, perhaps it is just the popularity of the sport or the quality of life changes around the world. In the last decade it appears that the older players have been significantly dominating, making this US Open’s results seem even more like an anomaly. Perhaps we should expect a reduction soon?
The lower density graph would indicate yes. When looking at the overall data disregarding the time series, we have a very symmetrical normal shaped density curve peaking at 0 (slightly negative actually). This indicates that over the long term there should be an equilibrium where age does not play a role. This intuitively makes sense, a player who had an age advantage and won would eventually get older, lose the age advantage and lose, making the net difference = 0. In reality it is not that simple, since a player would anitcipate their age disadvantage and most likely retire prior to their complete downfall. But this bottom graph does tell us that since we are centered at 0, the positive age difference of the upper graph should not continue for long based on historical outcomes.
So perhaps this year’s surprising US Open results are the beginning of an new era in tennis. Where there is a balanced competition between athletes of all ages, and the age variable becomes less dominant. Personally, I think it’s awesome to see the legends still holding their own, but its also exciting to see the youngsters get a chance to shine. I would be very happy to see that, and ill be tuning in.
- This analysis would not be possible without Jeff Sackmann who curated the data. It can be found on their github repository
- I also consulted the different data science posts relating to tennis on the towardsdatascience blog
- Out of all the posts, I used and edited some code that I found usefull from the Extensive Analysis of Women Tennis Matches post by Ambarish Ganguly; he actually referenced his blog in the Omnibus – Women’s and Men’s Tennis Matches Analysis
- NFL data plot
- Roger and Félix