each season, the bottom three teams get relegated to the second-highest division of English
football, in exchange for three promoted teams. A Premier League season usually takes place
from mid-August to mid-May. Each team gets to play every other team twice, once at home
and once on the road, hence there are a total of thirty-eight fixtures in a season for each
team (“Premier League Explained” 2020).
The most important aspect of the game of football is indisputably scoring goals. Despite
the significance of other factors like ball possessing or disciplined defending, we have to admit
that the main reason we pay to watch soccer is to see the ball being put in the back of the
net. The rule is very simple: in order to win, you must score more than your opponent. In
the Premier League, each match happens within the span of ninety minutes (plus stoppage
time), and the match consists of two 45-minutes halves. Each team can get one of these three
results after each match: a win, a draw, or a loss. If there is a draw, the two clubs receive a
point apiece, and for non-drawing matches, the winner is rewarded with three points and the
losing team gets punished with zero points. Thus the club with the most points at the end of
the year will have their hands on the exquisite EPL trophy, and the total points also
determines the fates of teams in the relegation zone (“Premier League Explained” 2020).
This makes every single match so critical, as losing one single point could end up costing a
team’s chance of winning a title or remaining in the top tier football league in England.
In this paper, we attempt to use statistical methods to model and predict goal scoring
and match results in the Premier League. We will first determine whether notable aspects of
goal scoring, namely, the number of goals scored, the time between goals, and time location
of goals in a match, fit the characteristics of a Poisson process. We will then use Poisson
regression to predict what would happen in the 2018-19 EPL season, for instance, which
clubs are more likely to win the title or get relegated, using different subsets of data from
prior seasons. The paper is outlined as follows: We first introduce the data that we used for
our analyses in Section 2. Next, our methodologies are described in Section 3. We then
spend the next two sections, 4 and 5, on our two main topics of this research - using the
Poisson process to model goal scoring, and utilizing Poisson regression to predict the 2018-19
season outcomes. Lastly, in Section 6, we give a quick summary of our results as well as
discuss possible future work related to this research.
2 Data
The first dataset for our investigation simply consists of match final scores of all Premier
League games from its inaugural competition, the 1992-93 season, to the last fixture of
2018-19 season. The main attributes of this dataset are the season, the home and away
teams, and the number of goals scored by each team. We rely on Football-Data.co.uk’s data
(Football-Data 2020), which contains all Premier League match final scores from 1993 to
2019. Each season has its own data file, and we read in and then join the individual datasets
together to get our desired data table. We utilize this data to model the number of goals
scored and then to make predictions of the 2018-19 season, using three different subset of
seasons: 1) data from all seasons prior to 2018-19, 2) data from only the 2010s, and 3) data
from all seasons, but assigning more weight to more recent competitions.
2