R is a powerful tool for statistical analysis and modeling in biostatistics. It offers a wide range of functions for descriptive statistics, hypothesis testing, and linear regression. These tools allow students to explore biological data, test hypotheses, and build predictive models.
Generalized Linear Models (GLMs) extend linear regression to handle non-normal data common in biology. GLMs can analyze binary outcomes, count data, and proportions, making them versatile for various biological research questions. R's GLM functions enable easy fitting and interpretation of these complex models.
Descriptive Statistics in R
Calculating Descriptive Statistics with R Functions
- R provides a wide range of functions for calculating descriptive statistics
mean()
calculates the arithmetic meanmedian()
calculates the median valuesd()
calculates the standard deviationvar()
calculates the variancemin()
andmax()
calculate the minimum and maximum valuessummary()
provides a comprehensive overview of a dataset, including measures of central tendency (mean, median), dispersion (standard deviation, interquartile range), and distribution shape (skewness, kurtosis)
Graphical Methods for Descriptive Statistics
- Graphical methods for descriptive statistics help visualize the distribution and characteristics of a dataset
hist()
creates a histogram to display the frequency distribution of a continuous variableboxplot()
creates a box plot to visualize the five-number summary (minimum, first quartile, median, third quartile, maximum) and identify outliersplot()
creates a scatter plot to visualize the relationship between two continuous variablesbarplot()
creates a bar chart to display the frequencies or proportions of categorical variables
Frequency Tables and Proportions
- The
table()
function is used to create frequency tables for categorical variables- Frequency tables show the count of observations for each category
- Example:
table(iris$Species)
creates a frequency table for the "Species" variable in the iris dataset
- The
prop.table()
function calculates proportions from a frequency table- Proportions represent the relative frequencies of each category
- Example:
prop.table(table(iris$Species))
calculates the proportions of each species in the iris dataset
Descriptive Statistics for Subgroups
- The
aggregate()
function allows for descriptive statistics to be calculated for subgroups defined by one or more categorical variables- Example:
aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)
calculates the mean sepal length for each species in the iris dataset
- Example:
- The
by()
function is another way to apply a function, such asmean()
orsd()
, to subgroups of a dataset defined by one or more factors- Example:
by(iris$Sepal.Length, iris$Species, mean)
calculates the mean sepal length for each species in the iris dataset
- Example:
- The
tapply()
function is used to apply a function across subgroups of a vector, defined by one or more factors- Example:
tapply(iris$Sepal.Length, iris$Species, mean)
calculates the mean sepal length for each species in the iris dataset
- Example:
Hypothesis Testing and Confidence Intervals in R
Common Hypothesis Tests in R
- R provides functions for common hypothesis tests
t.test()
performs t-tests for comparing means between one or two groups- One-sample t-test compares a sample mean to a hypothesized population mean
- Two-sample t-test compares means between two independent groups
- Paired t-test compares means between two related or paired groups
prop.test()
compares proportions between two groups or against a hypothesized valuechisq.test()
performs chi-square tests for independence (relationship between two categorical variables) and goodness-of-fit (comparing observed frequencies to expected frequencies)
Confidence Intervals in R
- Confidence intervals provide a range of plausible values for a population parameter with a specified level of confidence
- The
conf.int()
function obtains confidence intervals for various models, such as t-tests and linear regression- Example:
conf.int(t.test(iris$Sepal.Length))
calculates the confidence interval for the mean sepal length in the iris dataset
- Example:
- The
binom.test()
function performs hypothesis tests and obtains confidence intervals for binomial proportions- Example:
binom.test(sum(iris$Species == "setosa"), nrow(iris))
tests the proportion of setosa species in the iris dataset and provides a confidence interval
- Example:
Non-parametric Alternatives to t-tests
- The
wilcox.test()
function performs non-parametric Wilcoxon rank-sum (for independent groups) and signed-rank (for paired groups) tests as alternatives to t-tests when normality assumptions are violated- Wilcoxon tests compare the medians or distributions between groups
- Example:
wilcox.test(Sepal.Length ~ Species, data = iris, subset = Species %in% c("setosa", "versicolor"))
compares sepal lengths between setosa and versicolor species using the Wilcoxon rank-sum test
Linear Regression Modeling in R
Fitting Linear Regression Models
- The
lm()
function fits linear regression models in R- The formula argument specifies the dependent variable and independent variables using the syntax
dependent_variable ~ independent_variable1 + independent_variable2 + ...
- Example:
lm(Sepal.Length ~ Petal.Length, data = iris)
fits a linear regression model with sepal length as the dependent variable and petal length as the independent variable
- The formula argument specifies the dependent variable and independent variables using the syntax
- The
summary()
function applied to anlm
object provides a comprehensive overview of the model- Coefficients, standard errors, t-values, and p-values for each predictor
- Residual standard error, R-squared, and adjusted R-squared
- F-statistic and p-value for the overall model significance
Interpreting Linear Regression Coefficients
- The coefficients of a linear regression model represent the change in the dependent variable for a one-unit increase in the corresponding independent variable, holding other variables constant
- The intercept represents the expected value of the dependent variable when all independent variables are zero
- Example: In the model
lm(Sepal.Length ~ Petal.Length, data = iris)
, the coefficient for petal length represents the change in sepal length for a one-unit increase in petal length
Model Diagnostics and Predictions
- The
residuals()
function extracts the residuals from a fitted linear model- Residuals can be used to assess model assumptions, such as normality (using a Q-Q plot or histogram) and homoscedasticity (using a residual plot)
- The
predict()
function obtains predicted values from a fitted linear model- Predictions can be made for the original data or for new data points
- Example:
predict(lm(Sepal.Length ~ Petal.Length, data = iris), newdata = data.frame(Petal.Length = 5))
predicts the sepal length for a flower with a petal length of 5 cm
Model Comparison and ANOVA
- The
anova()
function performs analysis of variance (ANOVA) for comparing nested linear models and assessing the significance of individual predictors- Nested models are models where one model is a subset of the other
- Example:
anova(lm(Sepal.Length ~ Petal.Length, data = iris), lm(Sepal.Length ~ Petal.Length + Species, data = iris))
compares a model with petal length as the only predictor to a model with both petal length and species as predictors
- The
confint()
function calculates confidence intervals for the coefficients of a linear regression model- Example:
confint(lm(Sepal.Length ~ Petal.Length, data = iris))
provides confidence intervals for the intercept and slope coefficients
- Example:
Generalized Linear Models for Biological Data
Introduction to Generalized Linear Models (GLMs)
- Generalized linear models extend linear regression to accommodate response variables with non-normal distributions
- Binary data (e.g., presence/absence, success/failure)
- Count data (e.g., number of species, number of mutations)
- Proportion data (e.g., proportion of infected individuals, proportion of cells expressing a gene)
- The
glm()
function fits GLMs in R- The family argument specifies the distribution and link function for the response variable
- Common families include
binomial(link = "logit")
for binary data,poisson(link = "log")
for count data, andGamma(link = "inverse")
for strictly positive continuous data
Interpreting GLM Coefficients
- The coefficients of a GLM are interpreted based on the link function used
- For logistic regression (binomial family with logit link), coefficients represent the change in the log-odds of the response variable for a one-unit increase in the corresponding predictor
- For Poisson regression (Poisson family with log link), coefficients represent the change in the log-mean of the response variable for a one-unit increase in the corresponding predictor
- For Gamma regression (Gamma family with inverse link), coefficients represent the change in the mean of the response variable for a one-unit increase in the corresponding predictor
- The
summary()
function applied to aglm
object provides model coefficients, standard errors, z-values, and p-values, as well as deviance and AIC for model comparison
Predictions and Confidence Intervals
- The
predict()
function obtains predicted values from a fitted GLM- The
type = "response"
argument returns predictions on the response scale (e.g., probabilities for logistic regression, counts for Poisson regression) - The
type = "link"
argument returns predictions on the linear predictor scale (e.g., log-odds for logistic regression, log-mean for Poisson regression)
- The
- The
confint()
function calculates confidence intervals for the coefficients of a GLM- Confidence intervals are provided on the log-odds scale for logistic regression, log-mean scale for Poisson regression, or mean scale for Gamma regression
Model Comparison and Hypothesis Testing
- The
anova()
function can be used to compare nested GLMs using likelihood ratio tests or to assess the significance of individual predictors using Wald tests- Likelihood ratio tests compare the goodness-of-fit of two nested models
- Wald tests assess the significance of individual coefficients by comparing their estimates to their standard errors
- Example:
anova(glm(Species == "setosa" ~ Petal.Length, data = iris, family = binomial), glm(Species == "setosa" ~ Petal.Length + Sepal.Width, data = iris, family = binomial), test = "LRT")
compares a logistic regression model with petal length as the only predictor to a model with both petal length and sepal width as predictors using a likelihood ratio test