May 2, 2019

Modern basketball has 5 player positions: **point guard** (PG), **shooting guard** (SG), **small forward**
(SF), **power forward** (PF), and **center** (C). Each position requires a different set of skills, as they
they have different responsabilities on the court. Height and weight can also help excel at certain postitons.
For example, centers (the tallest and heaviest of the team) usually play near the baseline or close to the
basket.

I decided to visualize this relationship between height and weight, and the basketball position among NBA players. Once gathered all the information, build a model that could take custom height and weight parameters in order to predict or determine into which position the body measurements fit best.

**basketball-reference.com** (BR) is a great website for everything basketball related and the perfect source
for all the information I needed. BR has a player directory which include all NBA and ABA players from all
time indexed by letter with their respective position, height, and weight. There is just one problem. If you visit
a player's page on BR it will tell you their specific position or role such as PG or SF, but in the players by letter
index the positions are the "old" or "original" positions which are just 3: **guard**, **forward** and,
**center**.

As much as I would have loved to use the 5 modern positions, scraping each individual player's page was far more complex and time consuming than gathering the data from the players by letter pages. Fun fact, there are no players (last name) with the letter 'X'.

I used **BeautifulSoup 4** and some of the source code of
PandasBasketball to scrape the data off BR.
But there was another problem: some players can play up to 2 roles, making the positions: G, F, C, F-G, G-F,
F-C, and C-F for a total of 7 positions. So I had to clean this up by renaming the duplicates like G-F and F-G.
I finally ended up with only 5 positions, or set of positions:

**G**- Guard**F**- Forward**C**- Center**G-F**- Guard-Forward**F-C**- Forward-Center

After changing from imperial units to the metric system, the **4685x8** data frame looked like this:

Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | |
---|---|---|---|---|---|---|---|---|

0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 2.082823 | 108.862169 | June 24, 1968 | Duke University |

1 | Zaid Abdul-Aziz | 1969 | 1978 | F-C | 2.057423 | 106.594207 | April 7, 1946 | Iowa State University |

2 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 2.184426 | 102.058283 | April 16, 1947 | University of California, Los Angeles |

n | ... | ... | ... | ... | ... | ... | ... | ... |

If you want to download the whole dataset as a **.csv file**, click
here
.

Plotting all players' height (y) and weight (x) grouped by their position resulted in this image:

This is somewhat useful, but the plot felt a little bit too cramped and some of the colors like the
blue with the purple and the red with the orange were not as distinguishable as I would have liked.
So, I decided to plot some pair plots splitting positions into groups. In seaborn, the **pair plot
function** will create a grid of Axes such that each variable in the data will be shared in the
y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated
differently, drawing a plot to show the univariate distribution of the data for the variable in that
column.

The pair plot with all 5 positions is practically the same: too many points overlapping each other making it difficult to appreciate each group.

But using a group containing only the G, F, and C postitions and another group with G-F and F-C, helped (at least to me) to better visualize how are positions spread depending on the height and the weight.

Any basketball fan will tell you that the visualization does not provide any significal new information.
We already knew that **guards are shorter and lighter**, making them perfect for dribbling, that
**centers are the tallest and heaviest** to help protect the rim, and that **forwards are versatile**
being in between.

The G-F and F-C group was trivial as well, both spanning their corresponding individual positions along their respective area.

The k-nearest neighbors algorithm (KNN) is a non-parametric method used for classification and regression. Given a new sample (like the green dot), the algorithm will take the sample's k nearest neighbors (red triangles and blue square) by measuring the distance. After that, the algorithm will assign the green dot to either the triangle or the square category depending on which was more dominant among the neighbors. In this example, since 2 of the 3 neighbors of the green dot are red triangles, the new sample would be assigned to this category.

To apply this algorithm to my data set first I had to create a new column called "Cat", numbered from 0 to 4 to represent each position. The updated data frame then looked like this:

Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | Cat | |
---|---|---|---|---|---|---|---|---|---|

0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 2.082823 | 108.862169 | June 24, 1968 | Duke University | 2 |

1 | Zaid Abdul-Aziz | 1969 | 1978 | F-C | 2.057423 | 106.594207 | April 7, 1946 | Iowa State University | 2 |

2 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 2.184426 | 102.058283 | April 16, 1947 | University of California, Los Angeles | 0 |

n | ... | ... | ... | ... | ... | ... | ... | ... | ... |

I then split the data into training and test sets, scaled them, and built the actual classifier. The following block of code is an extract of this project's notebook. If you want to view or download the notebook click here.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.fit_transform(X_test) classifier = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2) classifier.fit(X_train, y_train)

To test how accurate my model was, I used a **confusion matrix**. A confusion matrix is a specific table layout
that allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in a
predicted class while each column represents the instances in an actual class (or vice versa).

Actual pos | ||||||
---|---|---|---|---|---|---|

G | F | C | G-F | F-C | ||

Predicted pos |
G | 347 |
46 | 1 | 15 | 0 |

F | 39 | 212 |
18 | 7 | 21 | |

C | 1 | 30 | 86 |
0 | 18 | |

G-F | 89 | 56 | 0 | 17 |
0 | |

F-C | 15 | 73 | 44 | 2 | 33 |

The matrix shows that the model has no major problem distinguishing between single positions among each other (G, F, C), but it is a little bit harder for the two-role positions (G-F, F-C). It is far from perfect but it will work just fine for some educated guessing or prediction.

The visualized KNN agorithm looks like this:

We can clearly see that the areas for the G, F, and C positions are pretty well defined, while the G-F and F-C position areas are just patched around the middle.

Finally I wrote a small function that takes any height (m) and weight (kg) to predict to which position is the person better suited for. I used my own height and weight and got the following results:

predict_pos(1.90, 85)

Prob of being guard: 0.8

Prob of being forward: 0.0

Prob of being center: 0.0

Prob of being guard-forward: 0.2

Prob of being forward-center: 0.0

According to my model, I'm better suited to be a guard.

I learned a lot doing this analysis, particularly the KNN classification method and its visualization. Although I know it is not perfect and there could be a couple of things I could tweak to maybe make it a little bit better, I'm really happy with the result.

If you want to download the whole notebook and the data you can go to this project's repo on github.