Blog
How to perform a multivariate analysis correctly?
Why perform a multivariate analysis?
There are two main reasons for performing multivariate analysis in biomedical research:
- Look for risk factors or predictive factors: of a pathology, a clinical result, belonging to a group, ...
- Take into account several parameters simultaneously: for example when you compare two groups but they have demographic differences which impact the result (confounding factors)
In general, when several parameters (variables) can influence the result, a multivariate analysis makes it possible to adjust the results to take these parameters into account simultaneously.
Let's take a simple example: you want to compare cardiovascular risk in men and women in the general population. It is known that men have a higher cardiovascular risk. However, you may find only a small difference between the two sexes. This may be due to a confusion bias: women live longer than men on average. And an older age is also a cardiovascular risk factor. Your study population could be older in the female group, which would artificially increase the cardiovascular risk in the female group. By including both gender and age in your model, you correct this confusion bias.
How to perform a multivariate analysis?
- Choose the variable to study Y
- Define the type of model to use: linear regression, logistic regression or Cox model
- Define the predictive variables X
- Check that there is no link between the predictive variables (multicollinearity)
- Check the regression assumptions
We will now detail these steps.
1. Choice of the study variable Y
This is usually the easiest step. It corresponds to your research hypothesis.
If you are looking for predictors of post-operative complications, your study variable Y is "post-operative complication".
2. Define the model type
It directly depends on the first stage.
EasyMedStat automatically chooses the type of model for you according to the type of the variable to be studied Y:
- Linear regression if Y is a continuous numeric variable
- Logistic regression if Y is a binary variable (yes / no)
- Cox model if Y is a censored variable (event)
Note that if you transform a continuous numeric variable into a binary variable, you will need to use logistic regression. For example, if you are looking to predict when a pain score is greater than 5/10, you are actually analyzing a binary variable (> 5/10 = yes, ≤ 5/10 = no).
3. Define the predictive variables X
Choose the right variables!
This is the most crucial step in your multivariate analysis!
This is where everything is played out. And the good news is, you don't need to have advanced statistical knowledge to choose these variables. On the other hand, you need a good knowledge of the pathology you are studying.
There are 2 types of variables that should generally be chosen as predictive variables:
- The variables corresponding to your hypothesis
- Variables known to influence the variable to be studied Y
Let us take a simplified example of a study on a new anti-aggregating treatment aimed at preventing the risk of myocardial infarction in patients who have never had a heart attack (primary prevention). You are testing this new treatment against a placebo. Your hypothesis is that there will be less myocardial infarction in the treatment group than in the placebo group.
- Your study variable Y is therefore the occurrence of a myocardial infarction.
- The variable X corresponding to your hypothesis is the treatment followed by the patient (antiaggregant or placebo).
- The X variables known to influence Y are the cardiovascular risk factors: age, male sex, diabetes, ...
You must therefore include in your model not only the treatment followed but also the age, the sex, the presence of diabetes, etc ...
The right number of variables
As often, the correct answer is "neither too much nor too little".
The number of variables in the model must be adapted to the number of patients you have available for your analysis. A generally accepted rule is to have at least 10 patients for each variable in the model. However, different opinions on the matter exist.
This 10 patient rule differs somewhat if you are performing logistic or linear regression. For linear regression, it is heard directly: if you analyze 70 patients, you can put up to 7 predictor variables in the model. For a logistic regression, we expect 10 patients in each group. So if you have a binary variable (yes / no) known for 70 patients with 30 patients who have the value "Yes" and 40 patients who have the value "No", we consider the smallest number, that is to say 30 patients. You can then only include 3 variables in the model.
Once again, this 10 patient rule is not entirely consensual. But this is a frequently accepted rule.
When you perform your multivariate analysis on EasyMedStat, the number of predictive variables is automatically checked.
It is also important not to include too few variables in the model. Otherwise, your analysis could be incomplete or even wrong. If you do not include the variable "diabetes" in a study to predict cardiovascular risk, it may be biased.
As you understand, you have to include enough variables to draw a model as close as possible to reality, but also to analyze enough patients. This is why multivariate analyzes are generally performed on relatively large samples, usually at least 100 patients (although this number is very arbitrary and can vary greatly depending on your data).
4. Check for multicollinearity
Behind this barbaric word hides a relatively simple concept. This is to make sure that your explanatory variables X are not statistically related to each other in too important a way.
For example, you should not include the weight variable in a model along with the BMI variable because there is a direct relationship between these two variables (BMI = weight / height squared).
EasyMedStat automatically checks the multicollinearity of your variables when you include them to avoid this problem.
5. Check the regression assumptions
The statistical veracity of your results depends on compliance with the assumptions of the model you are using. If you violate these assumptions, your results may be wrong.
These assumptions depend on the type of model you are using. They can include, among other things, linearity, absence of heteroskedasticity, normality of residues, etc.
However, these advanced concepts are checked automatically when you perform multivariate analysis with EasyMedStat. In the event of a violation of one of the assumptions, you are automatically informed and a solution is offered to you if possible.
Start your multivariate analysis now!
As you have understood, multivariate analysis is an advanced statistical technique but its use is facilitated by using suitable software.
This is precisely the case with EasyMedStat. You are guided throughout your analysis and you avoid the classic pitfalls in which you might otherwise fall.
LATEST POSTS






New Features in EasyMedStat: Custom Record ID (CRID) and Test/Production Modes [Product Update 3.36]


Let your friends know!