Predicting NBA Wins | Preliminary Model with Machine Learning

Thomas Tam
5 min readAug 27, 2020

--

Before I knew anything about Data Science, I liked to watch sports and would occasionally place sport bets. When I first started learning about sports betting I would always check sports forums to see what was the winning bet for a given game. People would talk about their “models” and “algorithm” and I had no idea what they were talking about…but sounded interesting.

When I started my journey in becoming a Data Scientist, I’ve always had the sports betting model in the back of my mind wanting to create it once I gained enough experience. For my classification project, we were allowed to choose a dataset we wanted to work on and thought this was a perfect opportunity to start my betting model!

Though this is only a classification model and not an unsupervised learning model, it was a good start to learn more about basketball data and get an understanding on how to start a model from scratch.

The goal of this classification is using historical data from 2004–2018 seasons, predict the NBA games 2019 season.

Dataset

I happened to find a dataset online that was already collected by a Kaggle user, https://www.kaggle.com/nathanlauga/nba-games. The dataset is split into 5 several csv files and provides statistics from the 2004–2019 (first 66 games of 2019).

Information of each dataset is shown below:

  • games.csv : all games from 2004 season to last update with the date, teams and some details like number of points, etc.
  • games_details.csv : details of games dataset, all statistics of players for a given game
  • players.csv : players details (name)
  • ranking.csv : ranking of NBA given a day (split into west and east)
  • teams.csv : all teams of NBA

The main csv files that were used in the model were games and games_details since most of the important features were in those files. Fortunately data cleaning didn’t take too much effort since it was most likely preprocessed already.

Important Features

A classmate happened to be working on creating the same NBA prediction model and he shared an article with me regarding important features.

Though I did not use the exact same features as the article mentioned, I decided to implement a few of the important features and add other features to see how the model will perform. The features I’ve used are shown below:

y_target = ['HOME_TEAM_WINS']X_features = ['FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home','REB_home', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away','AST_away','REB_away', 'BLK_home', 'PF_home','STL_home','TO_home', 'BLK_away', 'PF_away','STL_away','TO_away']

Other than the general statistics, percentage statistics I’ve figured is also important to a basketball game. An example of a team shooting poorly on a given day versus the opposing team shooting lights out will most likely be in favor of the opposing team, but yet this is still one feature out of many other features.

Data Exploration

PTS, REB, AST average statistics between 2004–2018 vs 2019
FG, FT, FG3 percentage average statistics between 2004–2018 vs 2019
multicollinearity for all features

When Team 1 takes a possession from Team 2, Team 1 is recorded a steal and Team 2 is recorded a turnover. This makes sense why the multicollinearity plot shows a high correlation between these features.

train test split between 2019 and 2004–2018 data

As mentioned before, I will be splitting the data between 2004–2018 (training data) and 2019 (testing data).

  • 2004–2018 season (training data) — 22131 games
  • 2019 season (test data) — 965 games

Model Algorithms

After learning about different model algorithms, it’s time to put them to the test. I’ve tried the following algorithms:

  • Logistic Regression predicts the probability of occurrence of a binary event utilizing a sigmoid function
  • Decision Trees —visual tree-like model of decision and decision making by implementing conditional control statement
  • Random Forest — ensemble method that randomly selects rows and specific features to build multiple decision trees from and then averages the results
  • K-Nearest Neighbors — stores available inputs and selecting a specified number of inputs (K), finds the closest distance between each input and votes for most frequent label
  • XGBoost
  • Gradient Boosting
  • LinearSVC

Best Model

Out of all the models, the model that performed the best was LinearSVC with an accuracy score of 89.74% on testing data.

ROC curve for the different algorithms

Future Iterations

This is only a preliminary model but think I did pretty good for my first classification model (if I did everything correctly). I would consider my accuracy score to be a bit on the high side. This is probably because my model does not generalize with upsets which happen quite often in the NBA. I would need to check what the percentage of upsets occur in the NBA and would need to compare my results.

There are a lot of improvements which can be made including parameter tuning and adding additional features. Because parameter tuning can take a lot of processing time and I had a deadline for my project, I was not able to fully build my model to the fullest. Referring back to the article with the important features, there are features I did not have in my dataset which I would like to implement into my model in the future.

Once I learn more about unsupervised learning and when I have time to come back to work on this project, I would continue to build and tune my model in order to increase the accuracy score to better help me predict NBA wins.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response