Relationship between NBA players positions and their height and weight

With a k-nearest neighbors algorithm

May 2, 2019

Modern basketball has 5 player positions: point guard (PG), shooting guard (SG), small forward (SF), power forward (PF), and center (C). Each position requires a different set of skills, as they they have different responsabilities on the court. Height and weight can also help excel at certain postitons. For example, centers (the tallest and heaviest of the team) usually play near the baseline or close to the basket.

I decided to visualize this relationship between height and weight, and the basketball position among NBA players. Once gathered all the information, build a model that could take custom height and weight parameters in order to predict or determine into which position the body measurements fit best.

Data (BR) is a great website for everything basketball related and the perfect source for all the information I needed. BR has a player directory which include all NBA and ABA players from all time indexed by letter with their respective position, height, and weight. There is just one problem. If you visit a player's page on BR it will tell you their specific position or role such as PG or SF, but in the players by letter index the positions are the "old" or "original" positions which are just 3: guard, forward and, center.

As much as I would have loved to use the 5 modern positions, scraping each individual player's page was far more complex and time consuming than gathering the data from the players by letter pages. Fun fact, there are no players (last name) with the letter 'X'.

I used BeautifulSoup 4 and some of the source code of PandasBasketball to scrape the data off BR. But there was another problem: some players can play up to 2 roles, making the positions: G, F, C, F-G, G-F, F-C, and C-F for a total of 7 positions. So I had to clean this up by renaming the duplicates like G-F and F-G. I finally ended up with only 5 positions, or set of positions:

After changing from imperial units to the metric system, the 4685x8 data frame looked like this:

Player From To Pos Ht Wt Birth Date Colleges
0 Alaa Abdelnaby 1991 1995 F-C 2.082823 108.862169 June 24, 1968 Duke University
1 Zaid Abdul-Aziz 1969 1978 F-C 2.057423 106.594207 April 7, 1946 Iowa State University
2 Kareem Abdul-Jabbar 1970 1989 C 2.184426 102.058283 April 16, 1947 University of California, Los Angeles
n ... ... ... ... ... ... ... ...

If you want to download the whole dataset as a .csv file, click here .


Plotting all players' height (y) and weight (x) grouped by their position resulted in this image:


This is somewhat useful, but the plot felt a little bit too cramped and some of the colors like the blue with the purple and the red with the orange were not as distinguishable as I would have liked. So, I decided to plot some pair plots splitting positions into groups. In seaborn, the pair plot function will create a grid of Axes such that each variable in the data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

The pair plot with all 5 positions is practically the same: too many points overlapping each other making it difficult to appreciate each group.


But using a group containing only the G, F, and C postitions and another group with G-F and F-C, helped (at least to me) to better visualize how are positions spread depending on the height and the weight.


Any basketball fan will tell you that the visualization does not provide any significal new information. We already knew that guards are shorter and lighter, making them perfect for dribbling, that centers are the tallest and heaviest to help protect the rim, and that forwards are versatile being in between.

The G-F and F-C group was trivial as well, both spanning their corresponding individual positions along their respective area.

KNN algorithm

The k-nearest neighbors algorithm (KNN) is a non-parametric method used for classification and regression. Given a new sample (like the green dot), the algorithm will take the sample's k nearest neighbors (red triangles and blue square) by measuring the distance. After that, the algorithm will assign the green dot to either the triangle or the square category depending on which was more dominant among the neighbors. In this example, since 2 of the 3 neighbors of the green dot are red triangles, the new sample would be assigned to this category.


To apply this algorithm to my data set first I had to create a new column called "Cat", numbered from 0 to 4 to represent each position. The updated data frame then looked like this:

Player From To Pos Ht Wt Birth Date Colleges Cat
0 Alaa Abdelnaby 1991 1995 F-C 2.082823 108.862169 June 24, 1968 Duke University 2
1 Zaid Abdul-Aziz 1969 1978 F-C 2.057423 106.594207 April 7, 1946 Iowa State University 2
2 Kareem Abdul-Jabbar 1970 1989 C 2.184426 102.058283 April 16, 1947 University of California, Los Angeles 0
n ... ... ... ... ... ... ... ... ...

I then split the data into training and test sets, scaled them, and built the actual classifier. The following block of code is an extract of this project's notebook. If you want to view or download the notebook click here.

            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
            sc_X = StandardScaler()
            X_train = sc_X.fit_transform(X_train)
            X_test = sc_X.fit_transform(X_test)
            classifier = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
  , y_train)

To test how accurate my model was, I used a confusion matrix. A confusion matrix is a specific table layout that allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).

Actual pos
Predicted pos G 347 46 1 15 0
F 39 212 18 7 21
C 1 30 86 0 18
G-F 89 56 0 17 0
F-C 15 73 44 2 33

The matrix shows that the model has no major problem distinguishing between single positions among each other (G, F, C), but it is a little bit harder for the two-role positions (G-F, F-C). It is far from perfect but it will work just fine for some educated guessing or prediction.

The visualized KNN agorithm looks like this:


We can clearly see that the areas for the G, F, and C positions are pretty well defined, while the G-F and F-C position areas are just patched around the middle.

Finally I wrote a small function that takes any height (m) and weight (kg) to predict to which position is the person better suited for. I used my own height and weight and got the following results:

            predict_pos(1.90, 85)

Prob of being guard: 0.8
Prob of being forward: 0.0
Prob of being center: 0.0
Prob of being guard-forward: 0.2
Prob of being forward-center: 0.0

According to my model, I'm better suited to be a guard.

Final notes

I learned a lot doing this analysis, particularly the KNN classification method and its visualization. Although I know it is not perfect and there could be a couple of things I could tweak to maybe make it a little bit better, I'm really happy with the result.

If you want to download the whole notebook and the data you can go to this project's repo on github.