Epidemiology & Data science

The top 3 mistakes that make your statistical analyses invalid

Statistical analyses: are you sure you know what you are doing?

I know, this post is unusual for me, since I tend to write a lot about diet and health. However, having worked for more than 10 years in the field of statistics and nutritional epidemiology, I have a lot of experience with statistical analyses as well. And, as you can imagine, I have done tons of mistakes as well!

Therefore, I have decided to share my experience with you and to grab pen and paper (well, more precisely: keyboard and screen) and to write a few lines to give you an idea of which common mistakes can invalidate your statistical efforts. Are you ready to go? Let’s start!

Mistake n. 1: skip descriptive statistical analyses

I can’t stress this enough: you cannot analyze data you don’t know and hope to get meaningful results! The number one step before doing any kind of hypothesis testing is to describe the variables you intend to analyze. Let’s suppose you want to study the association between high pasta consumptions and the risk of gaining body weight. If you test these two variables in a statistical model, without first describing pasta intakes in your population (as well as changes in body weight in a specific period of time) you might miss important facts and draw wrong conclusions. An example of this can be that pasta intakes in the population you are studying are too small to justify any possible association with body weight gain. Moreover, your population might have a stable weight during the period you are studying, and therefore no association between body weight change and any risk factor can possibly be detected. In both these cases, if you don’t describe the distribution (e.g. median and 25-75^th percentiles) of your variables of interest, you seriously risk to erroneously conclude that pasta intake does not contribute to an increased weight gain, when the only conclusion you can draw from data you have is that…. you cannot draw any conclusions at all!

A second important reason why you must describe your data before you analyze it, is that your dataset could contain either missing values or outliers (and often, both of them!). Missing values are those values that, for one reason or another, were not measured during the survey you performed or that you deleted by mistake (or because they were implausible). In the example before, some of your study subjects might have refused to communicate their body weight, which will therefore not be present in your dataset.

On the other hand, outliers are those values that are unreasonably big or small. Again, if we go back to the example of pasta consumption and weight gain, you might have observed that one of your subjects reported he/she eats 1 kg pasta per day. Let’s face it, this is really a lot of pasta. Even for me! In this case, subjects reporting implausible high pasta intakes can be excluded from the analyses. How do you define a reasonable threshold for implausible values? As a rule of thumb, you can define this threshold as the median plus or minus double the standard deviation.

Mistake n. 2: run statistical analyses without testing the model’s assumptions

As the old proverb goes: more haste, less speed. Again, this is a typical mistake people usually do when they are in a hurry. They pick a statistical model that is suitable for the hypothesis that has been chosen, without testing whether it fits the data as well.

Big mistake!

Again, know your data if you want to be able to trust your results. Many people think that the power of a statistical analysis entirely depends on the type of model chosen, i.e. the strongest the model, the more reliable the results are. I’m sorry to disappoint you, but this is absolutely far from being true.

The reality is that each statistical model has its specific assumptions. If your data violate the latter, the results are not guaranteed. Sometimes you need to work with data that is distributed according to a normal (also known as “Gaussian”) distribution, or you need your variables to be linearly related to each other. Additionally, the variables you want to analyze must have the same standard deviation (a phenomenon that is usually called “homoscedasticity”). It goes beyond the scope of this post to describe how you test these assumptions in practice. However, for the time being, what you need to know is that, if your data violate the assumptions of the statistical model you chose, even if the latter seems to be the most appropriate one based on your hypothesis, you need to find an equivalent non-parametric test. Non-parametric tests are like the US Marines of statistics: they work well even when all the others have failed! And, in addition to that, they can handle even non-normally distributed data.

Sometimes, the problem is that your dataset is simply too small and it is not materially possible to make your data fit the assumptions of the most popular models.

However, in some other cases, you can “bend” your data to make it fit the model’s assumptions simply by “transforming” it. What do I mean by this? Imagine that your data shows a distribution that does not follow the traditional Gaussian distribution which is important to obtain reliable results from a linear regression model, for instance. Most of the pasta intakes you measured, for instance, might be close to the median value, generating a distribution that is “skewed” around the mean value. In this case, you can try to transform your variable by applying a logarithm which could (if you are lucky enough) generate a proxy variable that fits the normality assumption. Not bad, right? Be careful with the interpretation of your results though, since you will have to transform the estimates back to the original data format (in the case of logarithms this means you must calculate the exponential of the estimates you obtain).

Mistake n. 3: randomly run statistical analyses without a clear hypothesis

Let me get this straight: to compare as many variables as humanly possible hoping to find a statistically significant association is the equivalent of shopping with your eyes covered and put anything you grab into the trolley hoping to go home with all the ingredients for a Caesar salad. You never know what you’ll end up buying and, definitely, it will be thanks to chance only if you will be able to go home with some chicken meat at least!

Suppose you have collected data from a population and that this data is sufficient to describe the characteristics of each individual at a very detailed level. You have data about their education, diet, lifestyle, their genetical profile, you know the kind of music they like, etc. etc. You might be tempted to subdivide your population into different sub-groups (healthy vs. non-unhealthy dieters, pop music vs. opera listeners, and so on) and to run a series of statistical models that compare each of these groups with all the others. At the end of your analyses, you might conclude that listening to pop music instead of opera is associated with having a healthy diet. Wow, this is a very interesting finding which is definitely worth a press release!

Well, hold on for a second before you imagine yourself of the cover of The Time magazine. Unfortunately, things are not as easy as you think. In fact, if you run a series of statistical analyses comparing any type of variable with any other given type, sooner or later you will find, at least, one significant result. However, by working in this way, you increase the probability that your significant outcome will be nothing else than a chance result. Indeed, it is absolutely possible to find a statistically significant positive correlation between the number of obese people worldwide and the number of records sold by Lady Gaga each year! Although you might think of many other reasons to do this, would you beg Mrs. Gaga to stop selling her records for the sake of curbing the obesity epidemic? Well, I don’t think so. Indeed, the two variables might simply correlate due to chance.

How can you avoid this problem? Simply by stopping running random comparisons and by starting instead to test a meaningful hypothesis. You need to have at least one sound reason to believe that two factors might be associated with each other, for instance. This can be either because someone else before you have found the same result and you want to confirm the same result in your data or because it is biologically, chemically, physically (or maybe even…. politically!) reasonable to assume that the two variables could be related to each other. Sitting in front of a tv night and day turns you into a couch potato. Therefore, testing the hypothesis that the number of hours spent in front of a screen is related to the risk of overweight, is a reasonable hypothesis.

Obviously, there are cases in which it is neither easy nor possible to formulate a reasonable hypothesis and where hypothesis-free (or results-driven) analyses are run. However, these are specific cases which are regulated by specific rules. We will discuss this issue another time.

* * *

That’s all for today folk! I hope you have learned something new and, if you liked this post, don’t forget to share it with the rest of the world via email or social media. And, if you need a smart guy who can help you with your data analyses, don’t hesitate to send me an email!

Gianluca Tognon

Gianluca Tognon is an Italian nutrition coach, speaker, entrepreneur and associate professor at the University of Gothenburg. He started his career as a biologist and spent 15 years working both in Italy and then in Sweden. He has been involved in several EU research projects and has extensively worked and published on the association between diet, longevity and cardiovascular risk across the lifespan, also studying potential interactions between diet and genes. His work about the Mediterranean diet in Sweden has been cited by many newspapers worldwide including the Washington Post and The Telegraph among others. As a speaker, he has been invited by Harvard University and the Italian multi-national food company Barilla.