Yucatan Traffic accidents

Analysis and regression excercise

April 1, 2019

I recently started an online course on machine learning with python, so in order to practice the first unit which was regression, I decided to make a small excercise using a traffic accident dataset. I also took the chance to play around with seabron as the visualization tool, instead of my usual matplotlib.

You can find both the source code and the data for this excercise here. You can also check out my github for my othe projects.

The data

The dataset used on this excercise (provided by INEGI) included the information for all traffic accidents in Mexico divided into states and three indicators: total number of traffic accidents, people injured in traffic accidents, and people killed in traffic accidents. The data spans from 1997 up to 2017.

Basic Information

After reading the data and selecting only those corresponding to Yucatán, the first thing I did was to get some basic information like the total number of accidents, people injured, and people killed. Where yuc_accidents, yuc_injured, and yuc_deadly are subsets of a larger dataframe including the accidents that ocurred on the 31st state.


The previous block of code yielded a total of 119692 traffic accidents on which 81090 people got injured and 1514 other got themselves killed.

The next thing I did was calculate the probability of getting yourself injured or killed after having suffered a traffic accident.

            (yuc_injured.Valor.sum() / yuc_accidents.Valor.sum()) * 100
            (yuc_deadly.Valor.sum() / yuc_accidents.Valor.sum()) * 100

Theres a 67.5% and 1.26% chance of getting injured or killed, respectively, if you are in a traffic accident while in Yucatán. I know this depends on many other factors like how fast were you going or if you were intoxicated, but as a quick probability I think it's okay. I guess the best way to put this is that almost 70% of traffic accidents end up in people injured and 1% on poeple killed? But now that I think of it, I am dividing people over number of accidents, so I guess it doesn't make sense? Eh, whatever, if you know exactly what those 67.7% and 1.26% represent hit me up at info@amedpal.com and tell me how (most likely) I'm wrong, please.


This time I decided to give seaborn a chance, since I always go with matplotlib to visualize my data. I gotta say, I liked it a lot. I made this satcked bar plot way easier than I would have using matplotlib, but I did use some of matplotlib's elements to render things like the legend and getting rid of the upper and right spines.



Now onto what made me start this whole excercise in the first place: regressions. Regression analysis is bascially (as to my understanding) a way o find relationship between variables which can then be used to predict. So, that's what I did (or at least tried).

I decided to use three regressors from the scikit-learn module: the linear regressor, the polynomial (degree=4) regressor and the random forest regressor (100 estimators). I will not get into the math behind them (mainly because I don't understand it fully), but you are free to inform yourself on it if you wish, it is quite clever.


The black dots are the original data and the colored lines are the indicated regressors. As we can see the linear regressor doesn't fit very well the dataset, while the polynomial and random forest ones appear to be a better option. But just to make sure I ran some metrics to calculate their score, or how good they are fitting the data.

I used the variance score and r2 score metrics for all three types of regressions and got the following results:

Variance score linear regressor: 0.446
Variance score polynomial regressor: 0.332
Variance score random forest: 0.962

r2 score linear regressor: 0.250
r2 score polynomial regressor: 0.006
r2 score random forest regressor: 0.895

1.0 is the best possible result, that means that the most effective was the random forest regressor followed by the linear regressor, not the polynomial. This actually came as a suprise for me since, from my perspective, it looked like the polynomial regressor was a better fit.


Now that I know which regressor fits better, it is time for some predictions! I decided to predict the total number of accidents for an unknown year. The only problem was that since the random forest regressor acts with like "steps", any year from 2016 and onwards yields the same result: 5558.84 total traffic accidents.

With the next best thing however, the linear regressor, I got some believable results for the years 2018 and 2019: 7000.10 and 7130.52 total accidents, respectively. Now to wait for the INGEI to update the dataset.


Although this was meant just to be a small personal excercise, I felt it was kind of rushed and it could have turned out better. I'm happy with the plots, though. I guess the only thig left to do is keep on working and practicing.