Where do NBA players come from

A visualization with geopandas and matplotlib

March 14, 2019

This was one of my first encounters and first project involving spatial data and geopandas. I made a small reddit post about it. If you want more information and some Q&A's I suggest you give the post a look.

This is not intended to be a tutorial but a kind-of-vague explanation on how I did it. You can check out my github or this project's repository if you want to download both the data and source code. To view the notebook online click here.

I decided to split this project into three areas: reading, preparing, and visualizing the data. The following blocks of code are extracts from my complete notebook which can be found here.

Reading the data

First things first, getting the data. Natural Earth Data is a great source of geospatial information in the form of .shp files; in this case I downloaded the states and provinces dataset in a 1:10 scale. After downloaded, the files need to be read and, since this dataset contains information from all countries around the world, I had to select only the states which belong to the U.S.A.

        states = gpd.read_file("ne_10m_admin_1_states_provinces.shp")
        usa_states = states[states["admin"] == "United States of America"]
        players_usa_states = pd.read_csv("players_usa_states.csv")

The last line of the previous block of code is just an extra .csv file I put together with the information on how many players each American state has ever produced, and the population for that particular state.

Preparing the data

Although we now have all the information in normal dataframes and geodataframes, we still need to make some adjustments to the data. I started with the basic like keeping only the useful columns, changing the Coordinate Reference System (CRS), and sorting by state name.

        usa_states = usa_states[["state", "geometry"]]
        usa_states = usa_states.to_crs(epsg=3395)
        usa_states.sort_values("state", inplace=True)

The bext step was to merge both frames, but before doing that I had to make sure their shapes matched. After merging we can peek on how the final frame is looking with .head(3).

        usa_states = pd.merge(usa_states, players_usa_states, on="state")

state geometry player_count pop urban%
0 Alabama POLYGON 83 4887871 59.0
1 Alaska POLYGON 1 737438 66.0
2 Arizona POLYGON 15 7171646 89.8

Visualizing the data

Now comes the fun part: visualizing! The argument 'column' takes the values from the specified column name and it will then use those values to paint the choropleth map. The .set method simply sets the range of coordinates to be rendered. In this case we are focusing on the United States and, to keep things simpler and cleaner, I decided not to include Alaska or Hawaii.

        fig, ax = plt.subplots(figsize=(10,5))
        usa_states.plot(ax=ax, column="player_count", cmap="YlGn", edgecolor="k", legend=True)
        ax.set_title("Number of NBA players by state of origin")
        ax.set(xlim=(-1.4*10**7, -0.74*10**7), ylim=(0.2750*10**7, 0.65*10**7))

It looks good but eh... it doesn't tell you much. You can see that states with large amounts of NBA players are also the ones with big populations (like California and New York). Therefore this is bascially just a population map and not very interesting (relevant xkcd).

To "fix" this I decided to plot maps with the number of NBA players by state of origin per 10,000 population, and per 10k population living in urban areas. In order to achive this though, I had to add some new columns to the dataframe.

        usa_states["per_10k"] = (usa_states["player_count"] / usa_states["pop"]) * 10000
        usa_states["urban_pop"] = usa_states["pop"] * (usa_states["urban%"] / 100)
        usa_states["per_10k_urban"] = (usa_states["player_count"] / usa_states["urban_pop"]) * 10000
NBAByState10K NBAByState10KUrban

Now this is more useful!

Conclusion and final notes

We can see that California is the biggest porfessional player generator with 395, followed by New York with 356, but as stated before, this is not a big surprise since their population is quite big. The state with more players per 10,000 population is District of Columbia with just a bit over 1 player. The second and third states are Mississippi and Louisiana with 0.29 and 0.25 players per 10,000 respectively. The big difference between the per 10k values of the first and second states was the reason I decided to explicitly set the max value of the color bar, otherwise the map would have looked mostly white. Alaska and Hawaii have produced 1 and 2 professional basketball players respectively.

Also, if you are wondering why Michigan looks funny, it is because it was plotted with its legal boundry which includes a lot of water.

This was a small, fun project which helped me understand better spatial data and its manipulation and visualization using Python. I'm happy on how it turned out and I may explore some vartions like how many points (or blocks, rebounds, etc.) each state has yielded.

The information on the actual number of NBA players by state is according to basketball-reference. The population numbers are according to wikipedia. You can find the complete notebook here.

Bonus: Europe!

The European continent.