How to calculate correlation on your data

Correlation. Not many web analyst do it, but we should be doing it. It’s the only way to be sure if there is a dependency between two sets of data.

A classic example is “weather temperature” vs “Sale of umbrellas” or “Sale of ice creams”. But what about the web analytics world? Shouldn’t we be calculating correlations on our datasets to see if there is a reasonable explanation behind the data we extract and the stories we tell? The answer is Yes, and if you disagree, I would like to hear the reason why.

I have my self, done loads of “analysis” looking at web analytics data and to be fair, we can do much better than just interpreting the data we see. We need to add math to the equation. The web analytics tools will not necessarily give us all of the answers, especially not when we are trying to predict if a metric is impacted by another metric or not.

What is a correlation?

In statistics, dependence is any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Source: wikipedia

When you correlate data, you will have a data range that can be between these ranges:

-1 = perfect negative correlation
0  = no correlation
1 = perfect positive correlation

You can have values that are in between the values stated above, such as:  

-0.9 = high negative correlation
-0.5 = low negative correlation
0.9 = high positive correlation
0.5 = low positive correlation

If your correlation value is 1, it means that there is a strong correlation (relationship) between the two data sets and there is link to when one metric goes up, the other is also going to go up.
0 means you don’t have any correlations between those two datasets and that they most likely don’t have anything common (there is the chance for a coincidence, depending on your data source).
-1 tells you that if one of them goes up, the other goes down, a negative correlation.

How do you calculate the correlation?

Before we correlate anything, we need to specify which two sets of data we are going to use. In our simple example, we will pretend we are the happy owners of “Amazon” and would like to see if there is a correlation between “Weather” and the amount of “Logins” for any movie/drama/tv-show. (weather data can be incorporated with API’s and feed the current weather condition with each visitor entering your website. Wunderground is my favorite).

Sample data (Not real data):

Temperature
Logins
14.2042,987
16.4041,938
12.0044,874
14.2042,398
17.3039,874
21.0036,928
19.2033,928
24.2031,039
23.5031,239
19.1035,672
21.6034,028
18.2041,438
18.4138,029

Now that we have our two data sources. To justify if there is a need to apply the correlation model, we can easily make a scatter plot to see if there is any “visible” correlation. Example of how the data would look like in a scatter plot:

Chart showing a almost perfect negative correlation

The reason it is a good idea to plot your data from your table into a scatter plot is to have a visual look at how the data entries are gathered. I can already tell now, by the way that the chart is displaying the plots that if I apply a trendline to the data, I will see a negative correlation. But I still don’t know if the correlation is defined as being “High negative perfect” or “Low negative perfect”. Please note, that if your data is displayed in a “curve-looking” scatter plot, then you will most likely have a correlation of 0, since correlations isn’t a good method to use on data points that fluctuate that much. So a good thumb rule is to ensure that data is somewhat displayed in a area that is “linear”. In order to achieve the final result, we need to apply some calculations. The easiest way is just using this tutorial for excel.

But since you and I like to understand how things work, we will take the time to show you how the calculations are done.

Follow these steps to learn how to correlate your web analytics data with other data sources:

  1. Calculate the mean of “Temperature”, and the mean of “Logins” (you can use the “average” function in excel)
  2. Create to new columns and name them “Temperature Mean” and “Logins Mean”
  3. Subtract the mean of “Temperature mean” from every “Temperature” value, do the same for “Logins Mean”
  4. Create 3 new columns and name them “Temperature^2”, “Logins^2” and “Temperature Mean * Logins Mean”
  5. Calculate: “Temperature Mean * Logins Mean”, “Temperature^2” and “Logins^2” for every value
  6. Sum up “Temperature Mean * Logins Mean”, sum up “Temperature^2” and sum up “Logins^2”
  7. Divide the sum of “Temperature Mean * Logins Mean” by the square root of [(sum of “Temperature^2”) × (sum of “Logins^2”)]

 

Temperature °C
Logins
Temperature Mean
Logins Mean
Temperature Mean^2
Logins Mean^2
Temperature Mean * Logins Mean
14.2042,9874.21-4,9581824,585,896-20,867
16.4041,9382.01-3,909415,283,539-7,851
12.0044,8746.41-6,8454146,859,729-43,868
14.2042,3984.21-4,3691819,091,802-18,388
17.3039,8741.11-1,84513,405,563-2,045
21.0036,928-2.591,10171,211,284-2,852
19.2033,928-0.794,101116,814,784-3,246
24.2031,039-5.796,9903448,854,275-40,481
23.5031,239-5.096,7902646,098,442-34,570
19.1035,672-0.692,35705,553,485-1,630
21.6034,028-3.194,0011016,004,667-12,769
18.2041,4380.21-3,409011,624,122-710
18.4138,029< Mean of columnSum of column>159255,387,587-189,278
Correlation:-0.94

Conclusion of the correlation value:

We have a correlation value of -0.94 which is in between a “Perfect negative” and “High Negative” correlation. This gives us a solid result of being able to state that when the temperature drops, we see the amount of logins increase. Or if you wanted to make the other statement, that the amount of Logins fall when the weather is good.