April 1, 2019

I recently started an online course on machine learning with python, so in order to practice the first unit which was regression,
I decided to make a small excercise using a traffic accident dataset. I also took the chance to play around
with **seabron** as the visualization tool, instead of my usual matplotlib.

You can find both the source code and the data for this excercise here. You can also check out my github for my othe projects.

The dataset used on this excercise (provided by INEGI) included the information for all traffic accidents in Mexico divided into
states and three indicators: **total number of traffic accidents**, **people injured in traffic accidents**, and
**people killed in traffic accidents**. The data spans from 1997 up to 2017.

After reading the data and selecting only those corresponding to Yucatán, the first thing I did was to get some basic information
like the total number of accidents, people injured, and people killed. Where *yuc_accidents*, *yuc_injured*,
and *yuc_deadly* are subsets of a larger dataframe including the accidents that ocurred on the 31st state.

yuc_accidents.Valor.sum() yuc_injured.Valor.sum() yuc_deadly.Valor.sum()

The previous block of code yielded a total of **119692 traffic accidents** on which **81090 people got injured** and
**1514 other got themselves killed**.

The next thing I did was calculate the probability of getting yourself injured or killed after having suffered a traffic accident.

(yuc_injured.Valor.sum() / yuc_accidents.Valor.sum()) * 100 (yuc_deadly.Valor.sum() / yuc_accidents.Valor.sum()) * 100

Theres a **67.5%** and **1.26%** chance of getting injured or killed, respectively, if you are in a traffic accident while in
Yucatán. I know this depends on many other factors like how fast were you going or if you were intoxicated, but as a quick probability
I think it's okay. I guess the best way to put this is that almost 70% of traffic accidents end up in people injured and 1% on poeple
killed? But now that I think of it, I am dividing people over number of accidents, so I guess it doesn't make sense? Eh, whatever,
if you know exactly what those 67.7% and 1.26% represent hit me up at **info@amedpal.com** and tell me how (most likely) I'm wrong, please.

This time I decided to give seaborn a chance, since I always go with matplotlib to visualize my data. I gotta say, I liked it a lot. I made this satcked bar plot way easier than I would have using matplotlib, but I did use some of matplotlib's elements to render things like the legend and getting rid of the upper and right spines.

Now onto what made me start this whole excercise in the first place: regressions. Regression analysis is bascially (as to my understanding) a way o find relationship between variables which can then be used to predict. So, that's what I did (or at least tried).

I decided to use three regressors from the **scikit-learn** module: the **linear** regressor, the **polynomial** (degree=4)
regressor and the **random forest regressor** (100 estimators). I will not get into the math behind them (mainly because
I don't understand it fully), but you are free to inform yourself on it if you wish, it is quite clever.

The black dots are the original data and the colored lines are the indicated regressors. As we can see the linear regressor doesn't
fit very well the dataset, while the polynomial and random forest ones *appear* to be a better option. But just to make sure
I ran some metrics to calculate their score, or how good they are fitting the data.

I used the **variance score** and **r2 score** metrics for all three types of regressions and got the following results:

Variance score linear regressor: **0.446**

Variance score polynomial regressor: **0.332**

Variance score random forest: **0.962**

r2 score linear regressor: **0.250**

r2 score polynomial regressor: **0.006**

r2 score random forest regressor: **0.895**

1.0 is the best possible result, that means that the **most effective was the random forest regressor** followed by the linear
regressor, not the polynomial. This actually came as a suprise for me since, from my perspective, it looked like the polynomial
regressor was a better fit.

Now that I know which regressor fits better, it is time for some predictions! I decided to predict the total number of accidents
for an unknown year. The only problem was that since the random forest regressor acts with like *"steps"*,
any year from 2016 and onwards yields the same result: **5558.84 total traffic accidents**.

With the next best thing however, the linear regressor, I got some believable results for the years 2018 and 2019:
**7000.10** and **7130.52** total accidents, respectively. Now to wait for the INGEI to update the dataset.

Although this was meant just to be a small personal excercise, I felt it was kind of rushed and it could have turned out better. I'm happy with the plots, though. I guess the only thig left to do is keep on working and practicing.