**Applied Statistics 2019 – **

Final Project

Final Project

**Content: **

This page is dedicated to introduce the group project for the course of Applied Statistics, by professor Secchi, which I attended in 2019 at the Politecnico di Milano.

We decided to report here all the analysis we did and the results we obtained, in order to be as clear as possible.

For this reason, we focused on the methodologies we used, and we also tried to produce some (hopefully) useful conclusions for the future.

**Group:**

Benedetta Maria Argenio

Alberto Cavarzeran

Andrea Pasotti

Federica Principe

Maria Rombolotti

**Table of contents:**

- Introduction
- Variability analysis – ANOVA
- MDS (Multi Dimentional Scaling)
- Charge point clustering – DBSCAN
- Prediction for new data – Kriging
- Conclusions
- Bonus part

**Introduction**

That of electric mobility is becoming a very hot topic nowadays.

The State of Ireland has invested a lot in it: in 2008 they wanted to diminish the use of fossil fuels and they set the target to have at least 10% of the national car fleet (equivalent to 200000 units) made by e-vehicles by 2020.

Unfortunately, their expectations weren’t quite fulfilled and the original target has dropped in 2017, as stated by the Irish Minister for the Environment in these few lines.

There are about 5500 EVs in the country, the 0.26% of total licenced cars and it has been predicted that there will be just 8000 EVs on the country’s roads by 2020.Denis Naughten–Minister for Communications, Climate Action and Environment [19 July 2017]

To further support the use of elecric cars, Irish government has created a powerful infrastructure of charging stations, made by a total of 343 charge points (that, at the moment, are available free-to-use).

As you can see from this picture, there exist two different types of charge points: the standard one that needs from 6 to 8 hours to charge a common e-car and the fast type, that can charge an electric vehicle up to 80% in just 25 minutes.

Our goal was to try to understand the strength and weaknesses of this powerful infrastructure of Charging stations, and to understand possible reasons for the overall low usage of them.

To do so, we were provided a dataset with all the characteristics of the Charge Points: an ID number, the type (Standard or Fast), a detail for the position location (urban, rural, industrial, motorway, shopping center, commercial), as well as others properties.

For each of the charging stations we then had another dataset, this one explaining the status of each charging unit (1 – currently under use / 0 – not used) calculated every 5 minutes for every day of the year 2018.

We started from these datasets by aggregating part of them (one of the things that we did, for example, was to evaluate the average annual usage of each charge point) and we performed different statistical analysis.

This to understand which were the most busy and why.

We studied the spatio-temporal patterns of charge points usage in Ireland and the global variability, alongside with possible dependences on different external factors.

In this gallery, you can see how the number of Charge Points per each County is qualitatively correlated to the population of the same County.

While making this analysis, we noticed that there were some infrastructures used way less than others, and that the peaks were concentrated around the capital city of Dublin.

After these first considerations, we stated the driving questions for all our analysis: what are the optimal positions for new stations? Does it make sense to add new turrets to increase the global use of the whole infrastracture?

**Variability analysis – ANOVA**

In order to estimate any significant difference of usage between the various locations, we have performed 3 different ANOVA tests for the main subclasses (Type, Position Detail, Area).

To build the most valuable model, we estimated the annual usage through 2018’s months in which there haven’t been any complete shutdown of the charge points. We haven’t spotted any particular usage pattern throughout different months, so we focused on the more representative ones. These months were the following: March, April, June, September, October, December.

Here some explanatory boxplots for the variable of interest: the average use.

After this, we tested for an overall Gaussianity of our data:

Then, even if not strictly necessary at this point, we decided to transform them via the Box Cox transformation suggested us by the function powerTransform().

The first class we’ve analyzed was *Fast *vs *Standard *Charge Points, representing the velocity of the chargepoint.

Here the boxplot for the data.

We first tested and verified the hypothesis of gaussianity within groups and homogeneity of variance between groups, and then we fitted the ANOVA model.

From this analysis we concluded that there was no evidence to state that it exists a difference in the average usage between *Fast* or *Standard *charge points.

As previously stated, though, each *Fast *charge point performs three times better than a *Standard *one (on average), so we concluded that *Fast* charge points contribute the most in terms of number of vehicles recharged.

The second class of interest we analized, took into account the urban position type:

We tested the model hypothesis over the groups:

# Pvalues per each class:

c1$p <- 0.14586 c2$p <- 0.09752

c3$p <- 0.77797 c4$p <- 0.36122

c5$p <- 0.26376

and then we fitted the model:

There is evidence over a difference in the average usage, so we proceeded the analysis with the construction of the univariate confidence intervals and the evaluation of the respective p-values of our tests:

As we can see, the only class having a significant difference in the average usage is the Motorway position, which has a lower usage.

The last class of interest involved the geographical position of the charge points:

Again, we verified the hypothesis and fitted the ANOVA model, which gave us this result:

There is enough evidence to believe that there is a difference in the average usage between the groups; evaluating the univariate confidence intervals we find a significant difference in the usage between the City group and both the others, as we can see from the p-values for the mean differences between the classes:

# {H0: mui=muj | H1: H0^c}

# p

[1] 0.00422 0.00072 0.37451

We concluded that the average usage of the Charge Points located in big cities is significantly higher than the other areas’, but there is not a significant difference between the usages of those located in Countries and Towns.

**MDS (Multi Dimentional Scaling**)

Now we’ll get back to the study of our dataset in the framework of Geostatistics.

The first thing we asked ourselves when we started working on the data, was how to be as close to reality as possible in the representation of our Spatial Data.

As you might already know, in the most generic problems of Geostatics, it is sufficient to have the coordinates of a certain amount of points (belonging to a geographic domain D), in order to be able to evaluate the distance between each two of them, with a Euclidean approach.

Our concern, indeed, was that in our specific situation the Euclidean distance wasn’t so much faithful to reality. This because we were studying a problem where the units/observations were placed into a road network and the Euclidean representation of the problem just did not take this into account.

In order to solve this problem, we decided to create a matrix of “real distances”, based on the data provided by the API of Google Maps.

{n.d.r. At first we tried to circumvent the problem of creating an account on the Google Maps API Platform by performing the calculation “by hand” on the online website… And this is when we realized that we had 343 charging stations in our dataset and that 343^2 was a pretty high number, even for the most committed students}

After “just” 6h of the API-code running {n.d.r. And 476$ spent of the beginner bonus provided by Google} we were able to look at our stunning 343×343 matrix.

Here we noticed that the notion of “Google Distance” wasn’t at all a distance in the mathematical way.

Apart from the positivity, all other properties weren’t fulfilled:

- The distance form A to B wasn’t almost never equal to that from B to A
- The triangular inequality didn’t hold for a lot of triples

We solved the problem of asymmetry by defining a weighted average of the two distances and, unfortunately, we basically ignored the problem of missing the triangular inequality propriety.

Here we just reported the distribution of the weighted difference between the Euclidean distance matrix evaluated from the initial dataset and the Google Maps Distance Matrix we just build.

Then, even with our “distance” matrix by the hands, we found out that not every of the standard algorithm in R worked with an arbitrary distance matrix.

This is when MDS (Multi Dimentional Scaling) entered in the game.

Thanks to MDS, we made it possible to evaluate a new set of fictitious coordinates (x,y) such that the Euclidean distance between these coordinates best resembled the original Google Maps one.

In the above plot we can notice the typical shape of the Ireland territories, just rotated a bit counterclockwise.

We may confirm the goodnes of this result by saying that the more dense part of the above picture on the center-right is interpretable as the capital city of Dublin (where most of the charge points are located) whereas the other dense area at the bottom is the city of Cork.

**Clustering – DBSCAN**

Started from the new distance matrix, we decided to perform a clustering study on charge points to understand the pattern of their usage.

We chose to use DBSCAN because it is the most “flexible” clustering algorithm, since it does not require to specify the number of clusters in the data a priori and it can find arbitrarily shaped clusters.

First, we applied a weighted DBSCAN** **to all the stations giving latitude, longitude and the percentage of annual usage as weight. The latitude and longitude we used are the ones recovered from Multi Dimensional Scaling.

As input we gave parameter minPoints and epsilon. We tried with different minPoints values and decided to take it equal to 10. Looking for a knee in the knndistplot we chose epsilon equal to 60000.

We obtained only two clusters, which are interpretable as the Dublin area versus the rest of the counties.

This first result was very interesting, as it highlighted the great difference that there’s between the Capital city and the remaining parts of the country.

In order to strengthen this result, we performed a weighted DBSCAN for every hour of the month May, 2018.

Weights were given by the percentage of usage of every hour. Most of the plots showed only two clusters, again interpretable as the Dublin area vs the rest of Ireland, while at particular times of the day we observed only one cluster, and in few other three.

The third cluster (that we can see in some hours of the day we reported) is interpretable as the urban area of Cork.

Indeed, Dublin and Cork are the first and second largest cities in the Republic of Ireland and the algorithm DBSCAN reported a significant higher usage in these two areas, compared to the rest of the country.

**Prediction of new data – KRIGING**

The cluster analysis made with DBSCAN reminded us our biggest question: “which are the best locations to add new charge points?”

In order to answer it, we performed a geospatial analysis. In particular, we started from the knowledge aquired in the previous parts and we used a Kriging technique, for a quantitative understanding of our problem.

First of all, we imported the fictitious coordinates recovered via MDS from the Google Maps Distance Matrix.

We focused our attention on the percentage likelihood that every charge point is used in a certain time of the year, eliminating all the charging towers that had a null usage and creating a bubble plot for them.

Even here we can see the typical shape of Ireland, slightly rotated counterclockwise, and we can identify the two zones where the charge points are mostly used: Dublin and Cork.

We then thought about the possibility to transform the variable Annual Use and make it symmetric.

With the following histogram we saw that the logarithm almost symmetrized the data, so we decided to use a logarithmic transformation.

After that, we fitted the empirical variogram and we determined that the best model was a spherical model with partial sill 1, range 80000 and nugget 2.

To obtain a better understanding of the overall prediction of the new location on the entire Irish soil, we created a Gstat object and a Grid of points, starting from the minimum coordinate and ending with the maximum coordinate.

After that comes the best part: the prediction itself.

In the following Heat Map, created with ggmap, we can see a qualitative pattern of the probability of finding occupied a charge point put in a new hypothetical location:

This image is not quite precise and should be handled with care, so some more considerations on this part are necessary.

Unfortunately, we couldn’t take into account the Position details of every point of the grid in this analysis, nor we could consider the existance or not of a road network in the rural areas.

Another issue we might find is that the introduction of a new charge point in the network might affect the usage probalility of its closest neighbours.

However, the image is still quite easily interpretable: we have a higher probability of finding a charging tower occupied in the areas that are closest to the big cities (Dublin and Cork over all).

In general, the probability decreases when moving away from the urban centers.

**Conclusions**

Data released by Irish government in the last months, show that the number of electric vehicles in Ireland is quickly growing. Thus, there will likely be an increasing need for charging stations.

Our analysis showed that new stations should be of Fast type and positioned in urban areas.

More precisely, we would add most of them in the areas close to Dublin and Cork, which are the two largest cities of Ireland.

Adding them close to Shopping Centre areas would be useful as well, while adding more of them on Motorways is not needed at the moment.

Why we state this? One of the problems of e-Vehicles is their shorter range of autonomy when compared to traditional solutions, to counter this problem Irish Government has already installed on the main motorways one charging station every at most 50km. For what we observed, this seems to be enough to support the usage on high speed roads.

Go Back Up

or

Read the Bonus part!

**Bonus Part**

If you wish, take a closer look to our work, here you can find the two presentations for the Workshops and the final poster.