R Language

Our research group uses R code for analyses and visualizing data and results. Although the majority is basic analyses to first understand the data and basic associations, we often do some interesting things that are of interest to others. This part is for anyone with an interest in general and targeted analysis of data and needs a starting point and introduction to basic R-code for testing for linear associations, comparing groups, and data visualization. Thanks to ggplot, Data Visualization – Practical Introduction, and Drs Batubo and Fuller for their contributions.

Firstly, R must be installed on your computer. This can be done by downloading and installing the latest version from https://www.r-project.org. Once installed, you can also have a look at R-studio, which can make using R easier. R-studio can be found and downloaded from https://posit.co/download/rstudio-desktop/. Once installed, you are ready to do some basic functions. For in depth learning, the R community is easily accessible online with most questions and problems already being answered somewhere. For a more guided approach, I encourage you to consider the books by Julian Faraway “Linear models with R” and “Extending Linear Models with R”. They are exceptional, first got me started in R many years ago, and I still come back to them.

Linear Models

  1. Create data set analysis 
  2. Linear Regression
  3. Logistic Regression

Comparing Groups

  1. Create second dataset for comparing groups
  2. t test
  3. Wilcoxon Mann–Whitney U test
  4. Chi-squared Test

Data Visualization

  1. Radar Plot / Spider Plot
  2. Ridgeline Plot
  3. Circular Bar Plot

Linear Models

Create data set analysis

First, we will create a dataset to use for these examples. The dataset will have 100 people with 5 columns of data for age (yrs), height (m), body mass index (BMI, Kg/m2), glucose (mmol/L) and mass (kgs). For these examples, we have written in a way to ensure that we find some significant associations. Also, be aware that because of the random number generation, that your results are likely to differ from mine. Also, this is random and artificial data so any results we do find may not reflect what we would see in real life.

# Number of people
n_people <- 100

# Generate random data for each variable
age <- sample(20:60, n_people, replace = TRUE)
height <- sample(1.5:2.2, n_people, replace = TRUE)
diabetes <- sample(0:1, n_people, replace = TRUE)

# Generate BMI based on glucose and age (positively associated)
height <- 1.75 + rnorm(n_people, mean = 0, sd = 0.15)
BMI <- 31.36 - (0.06 * age) + rnorm(n_people, mean = 0, sd = 2)
glucose <- (0.154 * age) + 4.72 + rnorm(n_people, mean = 0, sd = 2)
mass <- BMI * (height*height) + rnorm(n_people, mean = 0, sd = 2)

# Generate diabetes based on glucose, age, and BMI (positively associated)
prob_diabetes <- plogis(0.1 * glucose + 0.05 * age + 0.02 * BMI)
diabetes <- rbinom(n_people, size = 1, prob = prob_diabetes)

# Create the dataframe
df <- data.frame(BMI = BMI,  age = age,  glucose = glucose,  diabetes = diabetes, mass = mass, height = height)

Now with the dataset created, the first part of the codes below can be run. We will look to see if there are associations between variables within our study population (df).

To test for an association (i.e., a link) between two continuous variables (such as height and body weight), a linear model is often the best place to start. For this, we use the command ‘lm‘ which stands for linear model and ask it to test for an association between height and weight… or for every 1 metre increase in height, how much more mass (kg) does someone typically carry? This would make ‘height’ the exposure (or x) variable and mass the outcome (or y) variable.

We first run the lm model and store the outcome in new variable called ‘MassVsHeight’, and then look at the results by asking R to summarize the results for us

> MassVsHeight <- lm(mass ~ height)
>summary(MassVsHeight)
Call:lm(formula = mass ~ height)
Residuals:
     Min       1Q   Median       3Q      Max
 -16.2064  -4.4799  -0.8653   5.6583  17.0684
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -87.020      8.026  -10.84   <2e-16 ***
height       100.287      4.525   22.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.613 on 98 degrees of freedom
Multiple R-squared:  0.8337,	Adjusted R-squared:  0.832
F-statistic: 491.3 on 1 and 98 DF,  p-value: < 2.2e-16

The summary tells us that the link between height and mass is highly significant (i.e., at less than 2×10-16) and that for every 1 unit (1 metre) increase in height there is an approximate 100 kg increase in body mass. This a very big number and we can make it slightly more meaningful by making the exposure variable smaller and more realistic. So rather than a change of 1 metre, let’s divide the result by 100 (100 cm/metre) to estimate the average increase in mass (kgs) per 1 cm increase in height — the answer is 1kg. We can also see that the correlation between the variables is quite strong, whereby 0.8337 (or 83.4 %) of the variability in mass is explained by a corresponding change in height — e.g., taller people tending to be heavier, explains 83.4% of the difference in mass.

Logistic Regression

To test for an association between a continuous variables (such as BMI) on a categorical or binomial variable (i.e., true or false, yes or no), a logistic regression model is often the best place to start. For this, we use the command ‘glm‘ which stands for granulized linear model and ask it to test for an association between BMI and diabetes (which a person either has or doesn’t). Here, all non-diabetics have a ‘0’ in the ‘diabetes’ variable, and all diabetics have a ‘1’. We will use glm to test for change in likelihood (or odds) of developing diabetes for every 1 unit (kg/m2) that someone’s BMI increases (i.e., larger and typically more obese). This would make ‘BMI’ the exposure (or x) and diabetes the outcome (or y) variable.

We first run the lm model and store the outcome in new variable called ‘BMIVsDiabetes’, and then look at the results by asking R to summarize the results for us and then using another R command to make the result more interpretable.

> DiabetesVsBMI <- glm(diabetes ~ BMI, family = binomial)
> summary(DiabetesVsBMI)
Call:
glm(formula = diabetes ~ BMI, family = binomial)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  16.2129    11.3503   1.428    0.153
BMI          -0.4121     0.3691  -1.117    0.264
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 19.608  on 99  degrees of freedom
Residual deviance: 18.263  on 98  degrees of freedom
AIC: 22.263
Number of Fisher Scoring iterations: 7
# install the package to convert log odds to odds ratio for you
> install.packages("questionr")
> library(questionr)
> odds.ratio(DiabetesVsBMI)
Waiting for profiling to be done...
                    OR      2.5 %     97.5 %      p
(Intercept) 1.0994e+07 1.2650e-02 8.6189e+18 0.1532
BMI         6.6223e-01 2.7979e-01 1.3331e+00 0.2642

The summary tells us that there is a non-significant (p=0.265) association between BMI and risk of diabetes in our study population, whereby as BMI goes up we can’t be confident that diabetes risk goes up, down, or stays the same. Also, the effect size is in log odds (-0.4121) which is not very helpful, so we need to exponentiate it (ex or e-0.4121) to make the value easier to interpret as an odds ratio (OR). Rather than doing it by hand, we’ve used the command odds.ratio which does the work for us and even provides the 95% confidence interval. This tells us that the OR is 0.62 (i.e., 6.62×10-1) and the confidence interval ranges from 0.27 to 1.33, which makes it easier to appreciate why the association is not significant. It tells us that, if we were to recruit even more people from this random population that we recreated, we could be 95% confident that a 1 unit change in someone’s BMI could make them 73% (i.e., 1-0.27) less likely to be diabetic or 33% (i.e., 1.33 -1) more likely to be

Comparing Groups

Create second dataset for comparing groups

Now, suppose we want to compare differences between populations recruited form different places or treated differently (i.e., intervention vs control/placebo). Let’s create a second dataset (df2). Note, if you jumped to this step, be sure to go back and create the first dataset above too so that you have 2 datasets to compare.

# Number of people
n_people2 <- 100

# Generate random data for each variable
age2 <- sample(20:60, n_people, replace = TRUE)
height2 <- sample(1.5:2.2, n_people, replace = TRUE)
diabetes2 <- sample(0:1, n_people, replace = TRUE)

# Generate BMI based on glucose and age (positively associated)
height <- 1.75 + rnorm(n_people, mean = 0, sd = 0.15)
BMI2 <- 31.36 - (0.06 * age2) + rnorm(n_people, mean = 0, sd = 2)
glucose2 <- (0.154 * age) + 4.72 + rnorm(n_people, mean = 0, sd = 2)
mass2 <- BMI2 * (height2*height2) + rnorm(n_people, mean = 0, sd = 2)

# Generate diabetes based on glucose, age, and BMI (positively associated)
prob_diabetes2 <- plogis(0.1 * glucose2 + 0.05 * age2 + 0.02 * BMI2)
diabetes2 <- rbinom(n_people, size = 1, prob = prob_diabetes2)

# Create the dataframe
df2 <- data.frame(  BMI2 = BMI2,  age2 = age2,  glucose2 = glucose2,  diabetes2 = diabetes2, mass2 = mass2, height2 = height2)

t test

The first you may want to know is do the two groups differ by a certain variable. For example, do they differ by age? This can be done with a t-test. It will look at the average value for age in each group, see how closely all of the age values are to the average value for each group, and then tell you if you can be confident that they are different (p<0.05). In this case, we are going to use the t.test command to test if the ages of the people in group 1 (df$age) are different than the ages of people in group 2 (df2$age2).

> t.test(df$age, df2$age2)
	Welch Two Sample t-test
data:  df$age and df2$age2
t = -0.55333, df = 198, p-value = 0.5807
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.879307  2.179307
sample estimates:
mean of x mean of y
     41.12     41.97 

We see that the p-value is 0.5807, which tells us that the ages between the two groups are not very different at all. In fact, if we were to go back and recruit another group of people from the same populations for df1 and df2, the difference in age between df2 and df1 could anywhere between -3.879 yrs to +2.179 years, and everything in between. In other words, we have no evidence to confidently say that 1 group is older than the other.

Wilcoxon Mann–Whitney U test

The t-test is great where we are looking at things that are normally distributed (i.e. have a bell-shaped distribution) but in some cases we don’t know are aren’t confident that the data is normally distributed. In these cases, we need an alternative to the t-test, and this is often the Wilcoxon Mann–Whitney U test. Very simply, a normal distribution is bell shaped with most values hovering around the mean, some a little less and some a little more.

For my data, I see a more normal distribution in group 2 when I look at age (df2$mass2), with the majority of values are around the middle average and a relatively even distribution on either side of the average, compared to how age is distributed in group 1 (df$mass). I can even go a step further and run a Shapiro-Wilk test on df$mass to test for a normal distribution, where the p-value of 0.009789 (i.e., <0.05) confirms my suspicion of it being a non-normal distribution.

> hist(df$mass)
> hist(df2$mass2)
> shapiro.test(df$mass)
	Shapiro-Wilk normality test
data:  df$mass
W = 0.96531, p-value = 0.009789

This is good evidence to confirm my suspicions of a non-normal distribution fro df$mass and supports my hesitation to use a t-test to compare ‘mass’ between these two groups and to use the Wilcoxon Mann Whitney Test (or Man Whitney test) instead.

> wilcox.test(df$mass, df2$mass2)
	Wilcoxon rank sum test with continuity correction
data:  df$mass and df2$mass2
W = 10000, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

We see that the p-value is very significant (<1.2×10-16), which tells us that the weight (or mass) between the two groups are very different. If we were to go back and recruit another group of people from the same populations for df1 and df2, we can be confident that we would again see a different in their body weights.

Chi-squared Test

Finally, what if we want to compare two binomial or categorical variables, they can’t the plotted on a graph and with only ‘0’ and ‘1’, don’t provide a very good distribution. This is where the Chi-squared test is ideal. In this case, we want to know if the frequency of obesity cases is related to the frequency of diabetes cases in our first dataset.

> chisq.test(df$Obese, df$diabetes)
	Pearson's Chi-squared test with Yates' continuity correction
data:  df$Obese and df$diabetes
X-squared = 2.0973, df = 1, p-value = 0.1476
Warning message:
In chisq.test(df$Obese, df$diabetes) : 
 Chi-squared approximation may be incorrect

The resulting non-significant p-value (p=0.1476) tells us that the two categorical variables are not strongly related and are unlikely to related t

Data Visualisation

Create a new dataset for visualization

We will create two new datasets for data visualization: (1) “combined_data” has data on the quantity of 23 different types of fruit consumed by 2 groups of people (Group 1 and Group 2) each with 100 people; (2) “df” takes all of the individual data for each fruit in “combined_data” and simplifies it into 4 rows (maximum value in population, minimum value in population, average of Group 1, and average of group 2) for each fruit.

# Load necessary library
library(dplyr)

# Set seed for reproducibility
set.seed(123)

# Define fruit names
fruits <- c("Apple", "Banana", "Cherry", "Date", "Blueberry", "Fig", "Grape", "Honeydew", "Peach", "Jackfruit", "Kiwi", "Lemon", "Mango", "Pineapple", "Orange", "Papaya", "Apricot", "Raspberry", "Strawberry", "Tangerine", "Guava", "Nectarine", "Watermelon")

# Generate data for two groups
group1 <- data.frame(matrix(rnorm(2300, mean = 50, sd = 10), nrow = 100, ncol = 23))
group2 <- data.frame(matrix(rnorm(2300, mean = 55, sd = 15), nrow = 100, ncol = 23))

# Assign column names
colnames(group1) <- fruits
colnames(group2) <- fruits

# Combine groups into one data frame with an indicator column
group1$Group <- "Group1"
group2$Group <- "Group2"
combined_data <- bind_rows(group1, group2)

# Initialize summary table
df <- data.frame(matrix(ncol = 23, nrow = 4))
colnames(df) <- fruits
rownames(df) <- c("Max", "Min", "Mean_Group1", "Mean_Group2")

# Calculate summary statistics
for (fruit in fruits) {
  df[1, fruit] <- max(combined_data[[fruit]])
  df[2, fruit] <- min(combined_data[[fruit]])
  df[3, fruit] <- mean(group1[[fruit]])
  df[4, fruit] <- mean(group2[[fruit]])
}

Spider/Radar Plot

This plot is effective for offering a perspective of how the quantities of numerous items differ between two groups, or their pattern of levels/intake. As an example, by organizing your dataset, you can all of the “healthy” variables on one side of your table and “unhealthy” variables on the right side of the table and this would be how they show in the radar plot. When completed, it offers an easy way to highlight that one group has higher levels of “healthy” markers and the other group as higher levels for the “unhealthy” markers. The downside of this plot is that it’s not conducive to showing statistical data (e.g., error bars) because it released slowly on summary level data. Thanks to ggplot for the guidance and introduction.

#Load libraries
library(fmsb)
library(ggplot2)
library(tidyr)

# Create function
create_beautiful_radarchart <- function(data, color = c("dodgerblue", "yellow", "#FC4E07"),
                                         vlabels = colnames(data), vlcex = 0.7,
                                        caxislabels = NULL, title = NULL, ...) {
  # Select only numeric columns for radar chart
  numeric_data <- data[, sapply(data, is.numeric)]

    radarchart(
    numeric_data, axistype = 1,

    # Customize the polygon
    pcol = color, pfcol = scales::alpha(color, 0.4), plwd = 2, plty = 1,
    # Customize the grid
    cglcol = "grey", cglty = 1, cglwd = 0.8,
    # Customize the axis
    axislabcol = "grey",
    # Variable labels
    vlcex = vlcex, vlabels = vlabels,
    caxislabels = caxislabels, title = title, ...
  )
}

# Reduce plot margin
op <- par(mar = c(1, 1, 1, 1))

# Set row names to fruits
rownames(df) <- df$Category

# Set axis labels
create_beautiful_radarchart(df, caxislabels = c(0, 25, 50, 75, 100))

# Add a legend at the bottom
legend(
  x = "bottom", y= 5, legend = c("Group 1", "Group 2"), horiz = TRUE,
  bty = "n", pch = 20, col = c("dodgerblue", "gold1"),
  text.col = "black", cex = 1, pt.cex = 1.5)

For more examples of how to present the data, check out https://r-graph-gallery.com/143-spider-chart-with-saveral-individuals.html, it’s a great starting point for getting your data to look and be presented in the best way possible.

Ridgeline Plot

The Ridgeline plot is a great way to show how the distribution of data (e.g., average, spread, etc.) differs between variables. In this case, although not overly exciting, we can quickly see that “Watermelon” appears to be the fruit most highly consumed in this population, “Oranges” are left-skewed and may need transformation before analysis, and “Apricots” are bi-modal. In some instances, such as survey data, Ridgeline plots can very quickly show tendencies and trends within a population.


# Load Libraries
library(ggridges)
library(ggplot2)
library(dplyr)
library(tidyr)
library(forcats)

#Convert example data from wide to long format
combined_data_long <- gather(combined_data, fruit, value, Apple:Watermelon, factor_key=TRUE)
combined_data_long$fruit <- as.character(combined_data_long$fruit)

# Plot using ggplot
combined_data_long %>%
mutate(text = fct_reorder(fruit, value)) %>%
  ggplot( aes(y=fruit, x=value, fill=fruit)) +
  geom_density_ridges(scale = 0.7, alpha=0.6) +
  theme_ridges() +
  theme(
    legend.position="none",
    panel.spacing = unit(0.1, "lines"),
    strip.text.x = element_text(size = 8)
  ) +
  scale_x_continuous(breaks = seq(0, 100, by = 25), labels = seq(0, 100, by = 25))
  +  xlab("Quantity Consumed")
  +  ylab("Types of Fruit")

Now try changing some of the setting, one at a time, the ggplot styling variables like geom_density, panel.spacing, etc. to get the desired look. To get a good idea of what is possible with ridgeline plots, have a look at https://r-charts.com/distribution/ggridges and https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html for some examples and ideas.

Circular Bar Plot

A Circular bar plot is an easy way to show the levels of lots of different related variables in an efficient way. It’s can be used for molecular data (e.g., metabolites) or epidemiological data (average birth weight in each country around the world) to offer an overall perspective of the data and highlight differences between groups within the data. In this case, although not overly exciting, we can use a circular bar plot to show the average levels of intake for each fruit in each group and see how the levels differ (or not in this case) between Groups 1 and Groups 2. In many ways, it is similar to to a radar plot but when you have lots of variables, the circular bar plot is more effective and cleaner because each bar is independent (i.e., not connected to the bars next to it) and you can group and sort them to highlight key features.

However, the first thing we have to do rearrange the data table from a wide format to a long format that has 3 column (“Fruit” “Mean” and “Group”), this will make it easier to work with. There are all sorts of visualization that can be added to circular bar plots and I encourage to look around and find some examples that will work best for your data (e.g., https://r-graph-gallery.com/297-circular-barplot-with-groups.html).

#Make long data frame

df1 <- data.frame(matrix(ncol = 23, nrow = 3))
df2 <- data.frame(matrix(ncol = 23, nrow = 3))

colnames(df1) <- c("Apple", "Banana", "Cherry", "Date", "Blueberry", "Fig", "Grape", "Honeydew", "Peach", "Jackfruit", "Kiwi", "Lemon", "Mango", "Pineapple", "Orange", "Papaya", "Apricot", "Raspberry", "Strawberry", "Tangerine", "Guava", "Nectarine", "Watermelon")

rownames(df1) <- c("Fruit", "Mean", "group")

colnames(df2) <- c("Apple", "Banana", "Cherry", "Date", "Blueberry", "Fig", "Grape", "Honeydew", "Peach", "Jackfruit", "Kiwi", "Lemon", "Mango", "Pineapple", "Orange", "Papaya", "Apricot", "Raspberry", "Strawberry", "Tangerine", "Guava", "Nectarine", "Watermelon")

rownames(df2) <- c("Fruit", "Mean", "group")

df1[3,] <- group_1
df2[3,] <- group_2

df1[1,] <- c("Apple", "Banana", "Cherry", "Date", "Blueberry", "Fig", "Grape", "Honeydew", "Peach", "Jackfruit", "Kiwi", "Lemon", "Mango", "Pineapple", "Orange", "Papaya", "Apricot", "Raspberry", "Strawberry", "Tangerine", "Guava", "Nectarine", "Watermelon")
df2[1,] <- c("Apple", "Banana", "Cherry", "Date", "Blueberry", "Fig", "Grape", "Honeydew", "Peach", "Jackfruit", "Kiwi", "Lemon", "Mango", "Pineapple", "Orange", "Papaya", "Apricot", "Raspberry", "Strawberry", "Tangerine", "Guava", "Nectarine", "Watermelon")

for (fruit in fruits) {
  df1[2, fruit] <- mean(group1[[fruit]])
  df2[2, fruit] <- mean(group2[[fruit]])
}

df12 <- cbind(df1, df2)

long_fruit <- t(df12)
long_fruit <- as.data.frame(long_fruit)
long_fruit$Mean <- as.numeric(long_fruit$Mean)
long_fruit$Mean <- round(long_fruit$Mean, 1)

# long_fruit data now has fruit as row names, it's mean for everyone in each group as "Means", and a "Group" column to designate if the mean is for Group 1 or 2.

Now with a data in a more convenient format, we make the circular bar plot.


library(tidyverse)

# Insert a gap between groups

empty_bar <- 4
to_add <- data.frame( matrix(NA, empty_bar*nlevels(long_fruit$group), ncol(long_fruit)) )
colnames(to_add) <- colnames(long_fruit)
to_add$group <- rep(levels(long_fruit$group), each=empty_bar)
long_fruit <- rbind(long_fruit, to_add)
long_fruit <- long_fruit %>% arrange(group)
long_fruit$id <- seq(1, nrow(long_fruit))

# Create the labels for each column
label_data <- long_fruit
number_of_bar <- nrow(label_data)
angle <- 90 - 360 * (label_data$id-0.5) /number_of_bar
label_data$hjust <- ifelse( angle < -90, 1, 0)
label_data$angle <- ifelse(angle < -90, angle+180, angle)

# Make the plot (this where things can be customized further)
p <- ggplot(long_fruit, aes(x=as.factor(id), y=Mean, fill=group)) +
  geom_bar(stat="identity", alpha=0.5) +
  ylim(-100,120) +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    plot.margin = unit(rep(-1,4), "cm") 
  ) +
  coord_polar() + 
  geom_text(data=label_data, aes(x=id, y=Mean+10, label=Fruit, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=2.5, angle= label_data$angle, inherit.aes = FALSE ) 


p

Now try changing some of the setting, one at a time, the R gallery has lots of recommendations to get you going and you can even add in legend, group lines, and error bars. To get a good idea of what is possible with ridgeline plots, have a look at https://r-graph-gallery.com/circular-barplot.html.