🐛Biostatistics Unit 13 Review

13.3 Statistical analysis and modeling with R

🐛Biostatistics
Unit 13 Review

13.3 Statistical analysis and modeling with R

Written by the Fiveable Content Team • Last updated September 2025

🐛Biostatistics

Unit & Topic Study Guides

13.1 Introduction to R and RStudio for biological data analysis

13.2 Data manipulation and visualization using R packages

13.3 Statistical analysis and modeling with R

R is a powerful tool for statistical analysis and modeling in biostatistics. It offers a wide range of functions for descriptive statistics, hypothesis testing, and linear regression. These tools allow students to explore biological data, test hypotheses, and build predictive models.

Generalized Linear Models (GLMs) extend linear regression to handle non-normal data common in biology. GLMs can analyze binary outcomes, count data, and proportions, making them versatile for various biological research questions. R's GLM functions enable easy fitting and interpretation of these complex models.

Descriptive Statistics in R

Calculating Descriptive Statistics with R Functions

R provides a wide range of functions for calculating descriptive statistics
- mean() calculates the arithmetic mean
- median() calculates the median value
- sd() calculates the standard deviation
- var() calculates the variance
- min() and max() calculate the minimum and maximum values
- summary() provides a comprehensive overview of a dataset, including measures of central tendency (mean, median), dispersion (standard deviation, interquartile range), and distribution shape (skewness, kurtosis)

Graphical Methods for Descriptive Statistics

Graphical methods for descriptive statistics help visualize the distribution and characteristics of a dataset
- hist() creates a histogram to display the frequency distribution of a continuous variable
- boxplot() creates a box plot to visualize the five-number summary (minimum, first quartile, median, third quartile, maximum) and identify outliers
- plot() creates a scatter plot to visualize the relationship between two continuous variables
- barplot() creates a bar chart to display the frequencies or proportions of categorical variables

Frequency Tables and Proportions

The table() function is used to create frequency tables for categorical variables
- Frequency tables show the count of observations for each category
- Example: table(iris$Species) creates a frequency table for the "Species" variable in the iris dataset
The prop.table() function calculates proportions from a frequency table
- Proportions represent the relative frequencies of each category
- Example: prop.table(table(iris$Species)) calculates the proportions of each species in the iris dataset

Descriptive Statistics for Subgroups

The aggregate() function allows for descriptive statistics to be calculated for subgroups defined by one or more categorical variables
- Example: aggregate(Sepal.Length ~ Species, data = iris, FUN = mean) calculates the mean sepal length for each species in the iris dataset
The by() function is another way to apply a function, such as mean() or sd(), to subgroups of a dataset defined by one or more factors
- Example: by(iris$Sepal.Length, iris$Species, mean) calculates the mean sepal length for each species in the iris dataset
The tapply() function is used to apply a function across subgroups of a vector, defined by one or more factors
- Example: tapply(iris$Sepal.Length, iris$Species, mean) calculates the mean sepal length for each species in the iris dataset

Hypothesis Testing and Confidence Intervals in R

Common Hypothesis Tests in R

R provides functions for common hypothesis tests
- t.test() performs t-tests for comparing means between one or two groups
  - One-sample t-test compares a sample mean to a hypothesized population mean
  - Two-sample t-test compares means between two independent groups
  - Paired t-test compares means between two related or paired groups
- prop.test() compares proportions between two groups or against a hypothesized value
- chisq.test() performs chi-square tests for independence (relationship between two categorical variables) and goodness-of-fit (comparing observed frequencies to expected frequencies)

Confidence Intervals in R

Confidence intervals provide a range of plausible values for a population parameter with a specified level of confidence
The conf.int() function obtains confidence intervals for various models, such as t-tests and linear regression
- Example: conf.int(t.test(iris$Sepal.Length)) calculates the confidence interval for the mean sepal length in the iris dataset
The binom.test() function performs hypothesis tests and obtains confidence intervals for binomial proportions
- Example: binom.test(sum(iris$Species == "setosa"), nrow(iris)) tests the proportion of setosa species in the iris dataset and provides a confidence interval

Non-parametric Alternatives to t-tests

The wilcox.test() function performs non-parametric Wilcoxon rank-sum (for independent groups) and signed-rank (for paired groups) tests as alternatives to t-tests when normality assumptions are violated
- Wilcoxon tests compare the medians or distributions between groups
- Example: wilcox.test(Sepal.Length ~ Species, data = iris, subset = Species %in% c("setosa", "versicolor")) compares sepal lengths between setosa and versicolor species using the Wilcoxon rank-sum test

Linear Regression Modeling in R

Fitting Linear Regression Models

The lm() function fits linear regression models in R
- The formula argument specifies the dependent variable and independent variables using the syntax dependent_variable ~ independent_variable1 + independent_variable2 + ...
- Example: lm(Sepal.Length ~ Petal.Length, data = iris) fits a linear regression model with sepal length as the dependent variable and petal length as the independent variable
The summary() function applied to an lm object provides a comprehensive overview of the model
- Coefficients, standard errors, t-values, and p-values for each predictor
- Residual standard error, R-squared, and adjusted R-squared
- F-statistic and p-value for the overall model significance

Interpreting Linear Regression Coefficients

The coefficients of a linear regression model represent the change in the dependent variable for a one-unit increase in the corresponding independent variable, holding other variables constant
- The intercept represents the expected value of the dependent variable when all independent variables are zero
- Example: In the model lm(Sepal.Length ~ Petal.Length, data = iris), the coefficient for petal length represents the change in sepal length for a one-unit increase in petal length

Model Diagnostics and Predictions

The residuals() function extracts the residuals from a fitted linear model
- Residuals can be used to assess model assumptions, such as normality (using a Q-Q plot or histogram) and homoscedasticity (using a residual plot)
The predict() function obtains predicted values from a fitted linear model
- Predictions can be made for the original data or for new data points
- Example: predict(lm(Sepal.Length ~ Petal.Length, data = iris), newdata = data.frame(Petal.Length = 5)) predicts the sepal length for a flower with a petal length of 5 cm

Model Comparison and ANOVA

The anova() function performs analysis of variance (ANOVA) for comparing nested linear models and assessing the significance of individual predictors
- Nested models are models where one model is a subset of the other
- Example: anova(lm(Sepal.Length ~ Petal.Length, data = iris), lm(Sepal.Length ~ Petal.Length + Species, data = iris)) compares a model with petal length as the only predictor to a model with both petal length and species as predictors
The confint() function calculates confidence intervals for the coefficients of a linear regression model
- Example: confint(lm(Sepal.Length ~ Petal.Length, data = iris)) provides confidence intervals for the intercept and slope coefficients

Generalized Linear Models for Biological Data

Introduction to Generalized Linear Models (GLMs)

Generalized linear models extend linear regression to accommodate response variables with non-normal distributions
- Binary data (e.g., presence/absence, success/failure)
- Count data (e.g., number of species, number of mutations)
- Proportion data (e.g., proportion of infected individuals, proportion of cells expressing a gene)
The glm() function fits GLMs in R
- The family argument specifies the distribution and link function for the response variable
- Common families include binomial(link = "logit") for binary data, poisson(link = "log") for count data, and Gamma(link = "inverse") for strictly positive continuous data

Interpreting GLM Coefficients

The coefficients of a GLM are interpreted based on the link function used
- For logistic regression (binomial family with logit link), coefficients represent the change in the log-odds of the response variable for a one-unit increase in the corresponding predictor
- For Poisson regression (Poisson family with log link), coefficients represent the change in the log-mean of the response variable for a one-unit increase in the corresponding predictor
- For Gamma regression (Gamma family with inverse link), coefficients represent the change in the mean of the response variable for a one-unit increase in the corresponding predictor
The summary() function applied to a glm object provides model coefficients, standard errors, z-values, and p-values, as well as deviance and AIC for model comparison

Predictions and Confidence Intervals

The predict() function obtains predicted values from a fitted GLM
- The type = "response" argument returns predictions on the response scale (e.g., probabilities for logistic regression, counts for Poisson regression)
- The type = "link" argument returns predictions on the linear predictor scale (e.g., log-odds for logistic regression, log-mean for Poisson regression)
The confint() function calculates confidence intervals for the coefficients of a GLM
- Confidence intervals are provided on the log-odds scale for logistic regression, log-mean scale for Poisson regression, or mean scale for Gamma regression

Model Comparison and Hypothesis Testing

The anova() function can be used to compare nested GLMs using likelihood ratio tests or to assess the significance of individual predictors using Wald tests
- Likelihood ratio tests compare the goodness-of-fit of two nested models
- Wald tests assess the significance of individual coefficients by comparing their estimates to their standard errors
Example: anova(glm(Species == "setosa" ~ Petal.Length, data = iris, family = binomial), glm(Species == "setosa" ~ Petal.Length + Sepal.Width, data = iris, family = binomial), test = "LRT") compares a logistic regression model with petal length as the only predictor to a model with both petal length and sepal width as predictors using a likelihood ratio test

🐛Biostatistics Unit 13 Review

13.3 Statistical analysis and modeling with R

🐛Biostatistics Unit 13 Review

13.3 Statistical analysis and modeling with R

Unit & Topic Study Guides

Descriptive Statistics in R

Calculating Descriptive Statistics with R Functions

Graphical Methods for Descriptive Statistics

Frequency Tables and Proportions

Descriptive Statistics for Subgroups

Hypothesis Testing and Confidence Intervals in R

Common Hypothesis Tests in R

Confidence Intervals in R

Non-parametric Alternatives to t-tests

Linear Regression Modeling in R

Fitting Linear Regression Models

Interpreting Linear Regression Coefficients

Model Diagnostics and Predictions

Model Comparison and ANOVA

Generalized Linear Models for Biological Data

Introduction to Generalized Linear Models (GLMs)

Interpreting GLM Coefficients

Predictions and Confidence Intervals

Model Comparison and Hypothesis Testing

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

🐛Biostatistics
Unit 13 Review