According to our models the final will be Germany-Argentina. Are our data-driven models correct ? Let’s see what happens!!! #WorldCup2014
— Big Data Tales (@bigdatatales) 8 Luglio 2014
We published this tweet three days ago, before the two semi-finals of the World Cup 2014. Our prediction was correct: against any (brazilian) forecast Brazil was humiliated by Germany, while Argentina defeated Netherlands on penalties after a not exciting match. The final, thus, will be a very classic of football: Germany vs Argentina. How did we figure out the two winning teams? It was not a stroke of luck. It was more properly a “stroke of data”.
In 1970, our parents followed the “match of the century”, Italy-Germany 4-3, on a noisy black-and-white TV, tuned on the unique public channel the Italian government provided at that time. After many technological improvements, in 2006 we switched to LCD full-color screens, and watched the famous Zidane’s headbutt in high definition. Nowadays, data and social media have further enhanced our experience, making it from passive to interactive: we follow posts on Facebook or Twitter and publish our own, comment and share the most important and shocking events during the competition. This World Cup, in particular, will go down in history as the World Cup of Data and Statistics. TV shows, websites, blogs, data journalists, social media are publishing every day a plethora of plots, insights, predictions and statistical curiosities about players and teams. More and more websites dedicated posts to football stats and spread all the possible kind of data about football matches of every league.
The channels of such DataTV are manifold. We do not need to watch a match to know what’s happening in the World Cup. Thanks to data and social media, we are in the middle of a worldwide crowd: to guess what’s going on we can simply listen to what people are talking about. Data about visits on the Wikipedia pages of the national football teams, for example, show us clearly when the matches take place, and how much people find them exciting. See the figure below, which shows the number of visits of the Wikipedia pages of the four semi-finalists.
Clear peaks emerge when the matches take place. The highest peaks correspond to the most interesting events, like the opening match Brazil-Croatia (12th June) and the unexpected victory of Netherlands against the defending champion Spain (5-1, 13th June). The highest and most interesting peak (8th July) represents very well the “Brazilian tragedy”, the incredible and inexplicable defeat (1-7 !!) against Germany. During and after the semi-final, many people visited the page of the Brazilian national football team, shocked by the Brazilian debacle. Moreover, the page about the term “maracanaço”, which had almost zero visits before the match, has been suddenly visited lot of times the 8th July and the following days (see the figure below). Maracanaço indicates the 1950’s World Cup final between Brazil and Uruguay (see the related Wikipedia page), held at the legendary Maracanà stadium in Rio de Janeiro. Seleçao was favorited and, in the opinion of all, the strongest team of the World Cup. Everything was planned for an epic triumph of the Brazilian team. Instead, with the disbelief of all, Uruguay triumphed. It was the worst football memory of every Brazilian. Well, the worst since last tuesday, when Seleçao experienced the biggest defeat of its history.
Even though the Maracanaço and the Mineiraço (as the recent defeat has been called) suggest us that football it’s something unpredictable, the results of this World Cup were actually very easy to predict. Giants such as Microsoft, Google, and Goldman Sachs put lot of efforts on football data analysis, providing accurate predictions about the outcome of the matches. The algorithm developed by Microsoft, for example, takes advantage of “predictive models that assess the strength of each team using several factors such as past win or loss, record in qualification matches and other global competitions and margin of victory in those games“. From the round of 16 until now, the accuracy of this algorithm is 100%: they guessed all the teams who passed at the next step of the World Cup. In the construction of the models, Microsoft does not consider any strategy, or tactical modules of a team. The behavior of players in the field, the interactions between teammates, the clashes between opponents, are not considered at all. Only numbers, data and statistics from the past are exploited to predict the future. We reach almost the same prediction accuracy as Microsoft’s model, the same as Google’s model, but in a different way: we just look how the ball moves in the field.
Taca la bala
“Taca la bala” (attack the ball) was the leitmotif of Helenio “the wizard” Herrera, the legendary coach of Internazionale Milano during the sixties, when he won everything that there was to win.
To build our model, we started by looking at ball and player trajectories, analyzing how each team touch and attack the ball. By computing on the trajectories a measure of heterogeneity, we discovered that winning teams have more heterogeneous and unpredictable ball trajectories. This simple but powerful feature allowed us to guess all the team who passed from round of 16 to the final. The only exception (as for Google’s model) has been France-Germany: Benzema and company had a higher heterogeneity, but they lost 1-0 because of the goal of Hummels, a Germany defender who took advantage of a cross from a free kick. France had many chances to equalize, but they missed all of them: Neuer, Germany’s goalkeeper, was a hero.
In contrast with Microsoft and Google models, ours does not take into account history, recent and less recent victories, defeats or statistics. We consider how the team plays now in the field: given just a few matches, we are able to measure the strength of a team and estimate its chances to win the next games.
Road to the world cup
Before the semi-finals, this was the output of our model:
Germany and Argentina showed higher values with respect to the rivals, having the highest chances to play the 2014 World Cup final. Now, let’s update the measure to take into account trajectories of semi-finals and let’s look at the situation:
The 7-1 victory against Brazil raised the heterogeneity measure of Germany, who is now the favorite to win the final. The pictures below show a graphical representation of our model: arrows represent ball trajectories and colored rectangles represent the most important defending zones. Argentina is unbalanced toward the right, a place also called the “Messi tile”. Germany, in constrast, shows a more varied strategy: the zones where Muller and Klose are playing are highly connected, as well as the zone where other great players play (Ozil, Scwheinstiger, Kroos, Schurrle…). Germany presents a more diversified strategy, is more unpredictable, has more game choices than the Messi-based strategy of Argentina.
“At the end, the Germans always win”
The predictions of our model are in agreement with Microsoft and Google: Germany is going to win the World Cup 2014. “Football is a simple game: 22 men chase a ball for 90 minutes and at the end, the Germans always win”, as English player Gary Lineker said after the World Cup 1990 final (Germany-Argentina again).
Football, however, is not just numbers. As history taught us – the Maracanaço, the “Mano de Dios” by Maradona, the defeats in the final of legendary and seemingly invincible teams like the Netherlands of “Total Football” or the Hungary of Puskas – in football sometimes the weakest wins, sometimes something unpredictable happens.
We’ll see if numbers will win on randomness and, of course, we will go on with our studies about sports data. We expect a thrilling game and that another huge peak will emerge in the Wikipedia plot.
Let’s turn on our DataTV and clear our mind, the “last sacred drama of our time” is beginning. May the best team win, or even the worst.
Paolo Cintia and Luca Pappalardo
Another post about our work on football data has been published by the Physics World website, you can read it here.