Recently I began working with Kaggle’s datasets and kernel environment for data science, and have found them incredibly useful for quick, self-contained projects and trying out new techniques with Python. Kernels even have version control!
In an effort to branch out of my usual subjects of environment and animal-focused data, I gave myself a challenge: Sports. In this case, specifically soccer. Now, I have no background in soccer, or watching it, or knowing much of anything about it, but I set out to find what was most important in creating a team, and how to communicate player selection evaluation to others creating a team.
Here’s Kaggle’s FIFA 2017 dataset, which contains a player’s name, height/weight, positions, overall rating, and ratings for various measures, including aspects like accuracy, speed, stamina, reflexes, etc.
I determined that the best way to find out what makes a player worthwhile to draft would be to cut away unnecessary variables which did not explain the overall rating of the players. After adjusting the height and weight columns to remove units from each line, exploratory analysis revealed strong correlations (r < 0.6) between overall rating and reactions/composure.
After reducing the number of variables to work to 10 using Recursive Feature Elimination (RFE) from scikit-learn, I then used lasso regression to pull out the most impactful variables on overall rating.
The most explanatory variables? Ball control and reactions. Alone, they represented 70% of the variation in the overall rating. Not bad for only two dimensions out of the original 40! (The next two variables did not add much to the model.)
Basing my squad off of the best-ranked teams in the world (as reported by The Guardian) and World Cup 2018 guidelines, I decided on having 3 goalkeepers, 4 forwards, and 8 each of midfielders and defense for a total of 23 players.
Based on ball control, reactions, and overall rating
Goalkeepers: Manuel Neuer, De Gea, Hugo Lloris
Forwards: Luis Suárez, Paulo Dybala, Zlatan Ibrahimović, Sergio Agüero
Midfielders: Cristiano Ronaldo, Lionel Messi, Neymar, Iniesta, Luka Modrić, Arjen Robben, Franck Ribéry, Miralem Pjanić
Defense: Marcelo, Philipp Lahm, Dani Alves, Jordi Alba, David Alaba, Sergio Ramos, Carvajal, Javier Mascherano
In comparing the final squad in relevant ratings other than overall rating, we see that the best goalkeepers have poor heading, short pass, and ball control ratings than the rest of the squad. Clearly the important measures are different for goal keepers and field players, and these groups may be best analyzed separately.
Tell me, soccer fans: Would this be an excellent team? While I know more than I did about FIFA, I will still admit to being far from an expert. Show me your own teams!
Full code here.