Choosing the best Fantasy Premier League team using StatsBomb data and linear programming

This has been my first season playing FPL and experiencing its inversion of the football fan experience. I used to be a carefree Spurs fan enjoying Son smashing in the goals, revelling in Chelsea and Arsenal losing and not caring too much what was going on outside of Spurs’s competitors. Now I find myself hoping any Spurs player other than Son will score, wanting the likes of Tammy Abraham and Aubameyang to rack up the goals and having a minor heart-attack when Leicester get a goal praying that it’s not Vardy!

FPL has become an obsession and rather than agonising over which players to choose because I’m 1 million over budget, I wondered if I could write an algorithm to help me choose my team. As a typically lazy programmer I’d rather a computer do the work for me than have to think through all the options myself.

Around the start of this season I began a new job at StatsBomb who have the best football data in the world. Now that I had access to some data I could combine this with a technique called “linear programming”.

Conceptually linear programming is quite simple. It optimises a particular value given a set of constraints. This is a nice fit for FPL where we want to optimise our points and are constrained by budget, positions and number of players per team. So, a linear programming model can tell us which are the 15 best players to pick based on their points and the game constraints.

But what does points mean here? We could optimise based on the total number of points each player had already achieved in the season or how many points we’d expect them to score in the rest of the season. This is where the StatsBomb data comes in. I have a program that calculates an expected points total for each player based on expected goals, expected assists and some other stats. The program also looks ahead at the next three fixtures for each player and calculates another expected points total based on the difficulty of their fixtures. The linear programming algorithm is then coded to optimise based on the expected points total of each player over their upcoming fixtures.

In the next section I will explain my results using this model. You can skip to a more detailed explanation of the methodology if you prefer.

Results

According to my model this is the best team to pick based on the remaining fixtures (minimum of 1260 minutes played which is about half the season):

            Name             Expected points total
1          Salah             28.653292
2          Jesus             27.934728
3            Son             21.361012
4  Calvert-Lewin             18.983791
5           Jota             18.388117
6      H. Barnes             17.976550
7       Otamendi             17.002770
8        Doherty             15.595402
9          Pérez             15.174740
10      Cantwell             12.331330
11   Azpilicueta             12.302204
12         Saïss             10.408935
13     Lundstram              9.830002
14       Ederson              8.498704
15    Schmeichel              3.707534

The model will treat all 15 players equally even though we only receive points for 11 players. We don’t want to be wasting money on a reserve goalkeeper and third sub so I usually input a cheap goalkeeper and defender to the model. If I input McCarthy (Southampton’s goalkeeper) and Hanley (a defender from Norwich) who cost 4.5 million and 4.1 million respectively, these are the results:

            Name            Expected points total
1          Salah            28.6532918
2           Mané            28.4829100
3          Jesus            27.9347276
4  Calvert-Lewin            18.9837910
5           Jota            18.3881167
6      H. Barnes            17.9765501
7       Otamendi            17.0027702
8        Doherty            15.5954019
9       Cantwell            12.3313296
10   Azpilicueta            12.3022040
11    A. Pereira            12.2059326
12         Saïss            10.4089353
13       Ederson             8.4987043
14      McCarthy            -0.4744323
15        Hanley            -2.2322966

That looks better. The model tends to favour lower priced assets that profile well statistically. Jesus has the best xG in the league and is under 10 million whilst Calvert-Lewin has the fourth best xG and costs 6.5 million. It’s not surprising to see them picked. This is also true of Cantwell (4.7 million) who is in the top 40 in the league for xG and Saiss (4.6 million) who has the sixth best xG for defenders and plays in a strong Wolves defence.

I use this model as more of a guide in choosing my team so I would avoid some of the players it’s picked. Jesus is unlikely to play often when Aguero is fit and Andreas Pereira did not feature for Manchester United in the last three gameweeks that were played before the coronavirus lockdown.

Here are some further results from the model that may be of interest. Firstly, the 15 players to pick based on expected points per 90 minutes (this doesn’t take fixtures into account):

       	    Name              Expected points per 90
1          Jesus              3.3687544
2           Mané              3.0264806
3         Mahrez              2.5371978
4        Abraham              2.4541622
5  Calvert-Lewin              2.2637959
6      H. Barnes              2.0761379
7          Pérez              1.6569667
8       El Ghazi              1.5698837
9        Doherty              1.5381738
10      Otamendi              1.3219406
11   Azpilicueta              1.2790085
12     Lundstram              1.2154473
13         Saïss              1.0004529
14      Patrício              0.2907519
15    Schmeichel              0.1611271

Top 10 players for expected points across the remaining fixtures:

       	    Name              Expected points total
1          Salah              28.65329
2           Mané              28.48291
3          Jesus              27.93473
4         Agüero              24.94419
5      De Bruyne              23.28605
6       Sterling              22.86665
7            Son              21.36101
8         Mahrez              21.29700
9        Firmino              21.12340
10 Calvert-Lewin              18.98379

Top 10 goalkeepers:

         Name             Expected points total
1     Ederson             8.4987043
2     Alisson             5.7018773
3    Patrício             3.8643442
4  Schmeichel             3.7075341
5      de Gea             3.2641236
6        Kepa             3.0695601
7   Henderson             1.4139673
8      Guaita             0.9958724
9        Pope            -0.1593013
10       Leno            -0.3901269

Top 10 defenders:

               Name             Expected points total
1          Otamendi             17.002770
2           Doherty             15.595402
3  Alexander-Arnold             15.182936
4            Walker             12.329840
5       Azpilicueta             12.302204
6         Robertson             11.483739
7             Saïss             10.408935
8         Lundstram              9.830002
9          van Dijk              8.437297
10            Jonny              7.365544

Top 10 midfielders:

             Name              Expected points total
1           Salah              28.65329
2            Mané              28.48291
3       De Bruyne              23.28605
4        Sterling              22.86665
5             Son              21.36101
6          Mahrez              21.29700
7  Bernardo Silva              18.55397
8       H. Barnes              17.97655
9         Martial              17.46974
10           Alli              16.93351

Top 10 forwards:

            Name              Expected points total
1          Jesus              27.93473
2         Agüero              24.94419
3        Firmino              21.12340
4  Calvert-Lewin              18.98379
5        Abraham              18.86519
6           Ings              18.39151
7           Jota              18.38812
8        Jiménez              17.83874
9          Vardy              17.79381
10      Rashford              16.01084

I like the look of these results. The model is usually picking good players who we’d expect to score a lot of points.

Prior to gameweek 16 I’d been handpicking my own team. My overall rank was hovering around the very average 2 million mark. I used my wildcard in gameweek 16 and ran my model to pick a new team. Unfortunately things got worse before they got better. I only picked up 26 points in that gameweek and plummeted to 3.1 million in overall ranking. By gameweek 23 things had improved a little but my rank was still lower compared to where I was with my handpicked team. Since gameweek 24 my rank has improved each week so I am now just outside the top 750k.

Things changed between gameweeks 16 and 24. When I first used my model I was too trusting and picked my entire team based on the model’s recommendations. This meant I had expensive players on the bench and wasn’t taking into account upcoming fixtures. I only updated the model to optimise based on fixtures after gameweek 23. Sometimes the model will also choose players who look good statistically but they don’t make a lot of sense to pick. David McGoldrick for Sheffield United has very good xG numbers but he’s yet to score a goal in the PL. I didn’t end up falling into this trap but it’s always good to be wary. Blindly following the numbers is not the path to success.

Methodology

There are two components to my methodology. The first is a Clojure program that calculates expected points for each player. The second is a linear programming model written in R that works out the best 15 players to pick based on expected points.

Calculating expected points

I calculate expected points using the scoring rules of FPL with points being assigned proportionally according to StatsBomb’s expected goals (xG), expected assists (xA) and goals saved above average percentage (GSAA - for goalkeepers only) metrics. All of these statistics are averaged per 90 minutes and StatsBomb has home and away xG values for each team and player.

My program also adjusts xG for each player and his team’s xG conceded based on the upcoming opponent’s xG for and against. For example, let’s look at Mohamed Salah who is due to be playing against Arsenal away in gameweek 36. Salah’s xG away is 0.419 and Arsenal’s xG conceded at home is 1.382. These are the steps to calculate Salah’s xG for this match:

  • Calculate an average xG conceded across the Premier League for teams playing at home excluding Arsenal (as Arsenal cannot play against itself). This average is 1.174.
  • Divide Arsenal’s xG conceded at home by the average above which is 1.177.
  • Multiply Salah’s xG by 1.177 to get his xG against Arsenal which is 0.493

Arsenal are expected to concede more at home than the average so it makes sense that Salah has a higher xG for this match.

Midfielders can also get one point for a clean sheet. Liverpool concede 1.019 xG away and Arsenal’s xG at home is 1.357. So, my program will calculate Liverpool’s xG conceded for this match with a similar set of steps:

  • Calculate an average xG across the Premier League for teams playing at home excluding Arsenal. This average is 1.372.
  • Divide Arsenal’s xG at home by the average above which is 1.011.
  • Multiply Liverpool’s xG conceded by 1.011 to get their xG conceded against Arsenal which is 1.030

Arsenal are expected to score slightly more at home than the average so we’d expect Liverpool to have a slightly higher xG conceded value for this match.

So, I have xG for each player and each team’s xG conceded based on their upcoming opponents. I can now calculate the expected points total for each position.

There are some commonalities between positions such as points for clean sheets, goals and assists. I’ll describe the calculations for each of these before going onto the specifics of each position.

Expected points for clean sheets

Goalkeepers and defenders receive four points for a clean sheet and lose one point for every two goals conceded. To calculate an expected points total for this I multiply each team’s xG conceded by 4 and then subtract this from 4. So, a goalkeeper or defender whose team has 0.7 xG conceded is awarded with 1.2 points. If a team has more than 1 xG conceded then they will get negative points. For example, a goalkeeper whose team has 1.2 xG conceded would get -0.2 points. A value of 2.5 xG conceded would get -1.5 points.

Expected points for assists

Every player gets three points for an assist. Expected points for assists is simply the player’s xA multiplied by 3. So, a player with 0.25 xA per 90 minutes will be assigned 0.75 points.

Expected points for goals

Similarly with assists, expected points for goals is the player’s xG multiplied by the points they receive for their position. A midfielder gets 5 points for a goal so if they have an xG of 0.35 they are awarded 1.75 points.

Expected points for goalkeepers

They receive 4 points for a clean sheet, 1 point for every 3 saves and minus 1 point for every two goals conceded:

  • Firstly, I calculate their expected points for a clean sheet as described above.
  • I then add this value to the goalkeeper’s GSAA which will be a number between -1 and 1.
  • So, a goalkeeper whose team is expected to concede 0.9 xG against their opponent and has 0.15 GSAA, will have an expected points total of 0.55.
  • As an example, Vicente Guaita will play for Crystal Palace at home against Manchester United in gameweek 36. For this match I calculate Palace’s xG conceded as 1.0810. Guaita’s GSAA at home is 0.425. So, the expected points total for Guaita is 0.344.
  • Goalkeepers can get points for goals and assists but it’s extremely unlikely this would happen in a match so I don’t take this into account when assigning expected points.

Expected points for defenders

They receive 4 points for a clean sheet, 6 points for a goal, 3 points for an assist and minus 1 point for every two goals conceded:

  • Their expected points is simply the sum of the expected points for clean sheets, goals and assists described above
  • For example, in gameweek 36 Matt Doherty will play for Wolves against Burnley away. For this match I calculate Wolves’s xG conceded as 0.960 and Burnley’s as 1.133. Doherty’s xG for this game is 0.204 and his xA is 0.020. So, the expected points total for Doherty is 1.444.

Expected points for midfielders

They receive 1 point for a clean sheet, 5 points for a goal and 3 points for an assist:

  • Expected points for clean sheets is calculated similarly to what is described above but midfielders cannot get negative points for conceding so if their team’s xG conceded is 1 or above then they are awarded 0 expected points
  • Expected points is the sum of the expected points for clean sheets, goals and assists
  • For example, Kevin De Bruyne will play for Manchester City at home against Bournemouth in gameweek 36. His xG and xA for this game is 0.251 and 0.613 respectively. City’s xG conceded for the match is 0.663 and Bournemouth’s is 2.648. So, the expected points total for De Bruyne is 3.431.

Expected points for forwards

They receive 4 points for a goal and 3 points for an assist:

  • Their expected points is simply the sum of the expected points for goals and assists described above
  • As an example, Southampton’s Danny Ings will feature at home against Brighton in gameweek 36. His xG and xA for this game is 0.469 and 0.092 respectively. Brighton’s xG conceded for the match is 1.238. So, the expected points total for Ings is 2.152.

Choosing the best players using linear programming

With the expected points totals calculated I can feed the linear programming model this data to work out the best team. I am heavily indebted to Martin Eastwood’s linear programming code which I’ve borrowed with a few tweaks. Thanks Martin!

There isn’t a huge amount to describe here as the magic is in the linear programming algorithm. I’ve extended Martin’s code so that I can input my current team and the algorithm will tell me changes I can make. This is useful when I have free transfers and want to make a few changes for the upcoming gameweeks. I’ve also tweaked the code so that I can make sure the algorithm always includes certain players.

Improvements

I think I’ve had some good results using this model but there are some improvements I’d like to make:

  • Include historical data - at the moment I’ve only used data for this PL season which is a small sample size. Including data from previous seasons should give me more reliable results
  • Statistical analysis - I would like to understand how well this model predicts future results. It’s possible something like historical points scored is a better predictor of future success than expected points
  • Improve the Clojure code - it’s become messy and has no tests so it’s hard to iterate

Conclusion

I welcome any feedback on this approach so please leave a comment on the article or contact me on Twitter.

I’d also like to say thank you to my colleagues at StatsBomb who helped me with various parts of this model.