🫁Intro to Biostatistics Unit 2 Review

2.2 Probability distributions

🫁Intro to Biostatistics
Unit 2 Review

2.2 Probability distributions

Written by the Fiveable Content Team • Last updated September 2025

🫁Intro to Biostatistics

Unit & Topic Study Guides

2.1 Basic probability concepts

2.2 Probability distributions

2.3 Conditional probability

2.4 Bayes' theorem

2.5 Random variables

Probability distributions are essential tools in biostatistics, enabling researchers to model and analyze various biological phenomena. Understanding different types of distributions helps in selecting appropriate statistical tests and interpreting results in medical research and clinical trials.

From discrete to continuous, univariate to multivariate, each distribution type serves specific purposes in biostatistical analysis. Properties like central tendency, variability, and shape provide crucial insights into data behavior, guiding researchers in study design and statistical interpretation.

Types of probability distributions

Probability distributions form the foundation of statistical inference in biostatistics, enabling researchers to model and analyze various biological phenomena
Understanding different types of distributions helps in selecting appropriate statistical tests and interpreting results in medical research and clinical trials

Discrete vs continuous distributions

Discrete distributions deal with countable outcomes (whole numbers) common in biostatistical studies (number of patients, disease occurrences)
Continuous distributions represent variables that can take any value within a range, often used for measurements in medical research (blood pressure, drug concentration)
Discrete distributions use probability mass functions while continuous distributions employ probability density functions
Examples of discrete distributions include Binomial (success/failure in clinical trials) and Poisson (rare disease occurrences)
Continuous distributions encompass Normal (height, weight) and Exponential (waiting times between events in healthcare)

Univariate vs multivariate distributions

Univariate distributions describe the probability of a single random variable, frequently used in basic biostatistical analyses
Multivariate distributions model the joint probability of two or more variables, essential for complex medical studies
Univariate distributions help analyze individual patient characteristics (age, BMI)
Multivariate distributions enable the study of relationships between multiple health factors (blood pressure and cholesterol levels)
Covariance and correlation play crucial roles in understanding multivariate distributions in biostatistical research

Properties of distributions

Distribution properties provide insights into data behavior, guiding statistical analysis and interpretation in biomedical research
Understanding these properties helps researchers choose appropriate statistical methods and make informed decisions in study design

Measures of central tendency

Mean represents the average value, widely used in biostatistics to summarize data (average patient age in a clinical trial)
Median indicates the middle value, useful for skewed distributions (median survival time in cancer studies)
Mode shows the most frequent value, applicable in discrete data analysis (most common side effect in drug trials)
Geometric mean calculates the central tendency for data with multiplicative relationships (bacterial growth rates)
Harmonic mean used for rates and speeds in physiological studies (average reaction times in neuroscience experiments)

Measures of variability

Standard deviation quantifies the spread of data around the mean, crucial for assessing variability in medical measurements
Variance, the squared standard deviation, used in statistical tests and ANOVA in biomedical research
Range provides a simple measure of spread, indicating the difference between the highest and lowest values in a dataset
Interquartile range (IQR) measures spread in the middle 50% of data, robust to outliers in clinical data
Coefficient of variation (CV) allows comparison of variability between different scales, useful in comparing lab test precision

Skewness and kurtosis

Skewness measures the asymmetry of a distribution, important for identifying non-normal data in biostatistics
Positive skew indicates a long right tail (rare but extreme high values in drug response studies)
Negative skew shows a long left tail (occasional very low values in physiological measurements)
Kurtosis quantifies the "tailedness" of a distribution, affecting the reliability of statistical tests
Leptokurtic distributions have heavier tails, often seen in gene expression data
Platykurtic distributions have lighter tails, sometimes observed in anthropometric measurements

Discrete probability distributions

Discrete distributions model countable outcomes in biostatistical research, essential for analyzing categorical data and event counts
These distributions play a crucial role in designing and interpreting results from clinical trials and epidemiological studies

Bernoulli distribution

Models a single trial with two possible outcomes (success or failure)
Probability mass function given by $P(X=x) = p^x(1-p)^{1-x}$ where x is 0 or 1
Used in modeling presence/absence of a disease or treatment response in individual patients
Mean of Bernoulli distribution equals p, the probability of success
Variance calculated as $p(1-p)$ , important for determining sample size in clinical trials

Binomial distribution

Represents the number of successes in a fixed number of independent Bernoulli trials
Probability mass function $P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$
Widely used in clinical trials to model the number of patients responding to a treatment
Mean of binomial distribution is np, where n is the number of trials
Variance given by $np(1-p)$ , crucial for power calculations in study design

Poisson distribution

Models the number of events occurring in a fixed interval of time or space
Probability mass function $P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$
Applied in rare disease occurrence studies and modeling adverse events in drug safety
Mean and variance both equal to λ, the rate parameter
Approximates binomial distribution when n is large and p is small

Negative binomial distribution

Describes the number of failures before a specified number of successes occur
Used in modeling the number of disease-free days before a relapse in chronic conditions
Probability mass function $P(X=k) = \binom{k+r-1}{k}p^r(1-p)^k$
Mean given by $\frac{r(1-p)}{p}$ , where r is the number of successes
Variance calculated as $\frac{r(1-p)}{p^2}$ , often used in overdispersed count data analysis

Continuous probability distributions

Continuous distributions model variables that can take any value within a range, crucial for analyzing measurements in biomedical research
These distributions underpin many statistical tests and models used in biostatistics, from t-tests to regression analysis

Normal distribution

Symmetric, bell-shaped distribution fundamental to many statistical methods in biostatistics
Probability density function $f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
Characterized by mean (μ) and standard deviation (σ)
Central to the Central Limit Theorem, justifying many parametric tests in large samples
Z-scores derived from normal distribution used for standardizing and comparing different scales

Student's t-distribution

Similar to normal distribution but with heavier tails, crucial for small sample inference
Used in t-tests and confidence intervals for means in biomedical studies
Shape determined by degrees of freedom, approaching normal distribution as df increases
Probability density function involves gamma functions and is more complex than normal
Critical in analyzing small sample sizes common in early-phase clinical trials

Chi-square distribution

Arises from the sum of squared standard normal variables
Degrees of freedom determine the shape of the distribution
Used in goodness-of-fit tests and analysis of categorical data in epidemiology
Forms the basis for tests of independence in contingency tables (e.g., case-control studies)
Plays a role in confidence intervals for population variance in laboratory studies

F-distribution

Ratio of two chi-square distributions divided by their respective degrees of freedom
Fundamental to Analysis of Variance (ANOVA), widely used in comparing multiple groups
Shape determined by two parameters: degrees of freedom for numerator and denominator
Critical in assessing the significance of added variables in multiple regression models
Used in testing equality of variances (e.g., assessing homogeneity in multi-center trials)

Sampling distributions

Sampling distributions describe the behavior of sample statistics, crucial for inferential statistics in biomedical research
Understanding these distributions enables researchers to make inferences about population parameters from sample data

Distribution of sample mean

Describes the variability of sample means across different samples from the same population
For normal populations, sample mean follows a normal distribution regardless of sample size
Standard error of the mean (SEM) quantifies the standard deviation of the sampling distribution
SEM decreases as sample size increases, improving precision of estimates in larger studies
Forms the basis for constructing confidence intervals for population means in clinical research

Central limit theorem

States that the sampling distribution of the mean approaches a normal distribution as sample size increases
Applies regardless of the underlying population distribution, with some exceptions
Crucial for justifying the use of parametric tests in large samples, even for non-normal data
Generally considered applicable when sample size exceeds 30 for most distributions
Enables the use of z-scores and normal probabilities in inferential statistics

Standard error

Measures the variability of a sample statistic (e.g., mean, proportion) across different samples
Calculated as the standard deviation of the sampling distribution
For means, standard error = $\frac{\sigma}{\sqrt{n}}$ , where σ is population standard deviation
Decreases as sample size increases, reflecting increased precision in larger studies
Used in constructing confidence intervals and conducting hypothesis tests in biostatistics

Probability density functions

Probability density functions (PDFs) and their discrete counterparts are fundamental tools for describing and analyzing probability distributions in biostatistics
These functions enable the calculation of probabilities and form the basis for many statistical inference techniques

Probability mass function

Describes the probability distribution for discrete random variables
Gives the probability that a discrete random variable equals a specific value
Sum of probabilities over all possible values equals 1
Used in modeling count data (number of adverse events, disease occurrences)
Forms the basis for likelihood calculations in discrete data analysis

Cumulative distribution function

Represents the probability that a random variable takes a value less than or equal to a given value
Applies to both discrete and continuous distributions
For continuous distributions, CDF is the integral of the probability density function
Used in calculating percentiles and quantiles in biostatistical analyses
Critical in survival analysis for estimating probabilities of events occurring by certain times

Applications in biostatistics

Probability distributions find extensive applications in various areas of biostatistics, from epidemiology to clinical trials
Understanding these applications helps researchers choose appropriate statistical methods and interpret results accurately

Disease prevalence estimation

Binomial distribution used to model the number of disease cases in a population sample
Normal approximation to binomial enables confidence interval calculation for large samples
Beta distribution often used as a prior in Bayesian estimation of disease prevalence
Poisson distribution applied in rare disease prevalence studies
Negative binomial distribution useful for overdispersed count data in disease mapping

Clinical trial outcomes

Bernoulli trials model individual patient outcomes (success/failure) in clinical trials
Binomial distribution describes the number of successes in fixed-size trials
Normal distribution approximates treatment effects in large randomized controlled trials
Student's t-distribution used for small sample inference in early-phase trials
Survival distributions (Weibull, exponential) model time-to-event outcomes in long-term studies

Survival analysis

Exponential distribution models constant hazard rates in survival studies
Weibull distribution allows for increasing or decreasing hazard rates over time
Log-normal distribution used for modeling survival times with early peak hazard
Gamma distribution provides flexible modeling of survival times in complex scenarios
Cox proportional hazards model uses partial likelihood based on hazard distributions

Transformations of distributions

Transformations help in dealing with non-normal data, enabling the use of parametric methods and improving model fit in biostatistical analyses
Understanding these transformations is crucial for handling skewed or heteroscedastic data common in biomedical research

Log-normal distribution

Arises when the logarithm of a variable follows a normal distribution
Often used for modeling biological variables with positive skew (drug concentrations, antibody levels)
Probability density function involves the natural logarithm of the variable
Geometric mean and geometric standard deviation are key parameters
Useful in pharmacokinetic studies and modeling growth rates in microbiology

Box-Cox transformation

Family of power transformations to approximate normal distribution
Includes logarithmic transformation as a special case
Formula: $y(\lambda) = \frac{y^\lambda - 1}{\lambda}$ for λ ≠ 0, and log(y) for λ = 0
Optimal λ chosen to maximize normality, often through maximum likelihood estimation
Applied in regression analysis to stabilize variance and improve model fit in biomedical data

Goodness-of-fit tests

Goodness-of-fit tests assess how well observed data conform to a theoretical probability distribution
These tests are crucial in validating distributional assumptions underlying many statistical methods in biostatistics

Kolmogorov-Smirnov test

Non-parametric test comparing the cumulative distribution of sample data to a reference distribution
Calculates the maximum distance between empirical and theoretical cumulative distributions
Sensitive to differences in both location and shape of the distributions
Used for testing normality and other continuous distributions in biomedical data
Limitations include reduced sensitivity to tail differences and discrete distributions

Anderson-Darling test

Modification of the Kolmogorov-Smirnov test with greater sensitivity to tail differences
Gives more weight to the tails of the distribution in the test statistic calculation
Often preferred for testing normality in biostatistical applications
More powerful than Kolmogorov-Smirnov for detecting departures from normality
Critical values depend on the specific distribution being tested

Multivariate distributions

Multivariate distributions model the joint behavior of two or more random variables, essential for analyzing complex relationships in biomedical data
Understanding these distributions is crucial for advanced statistical techniques like multivariate regression and factor analysis

Bivariate normal distribution

Extension of univariate normal distribution to two dimensions
Characterized by means, standard deviations, and correlation coefficient of two variables
Probability density function involves a complex exponential term with covariance matrix
Contours of equal probability form ellipses in the two-dimensional plane
Used in modeling paired measurements (systolic and diastolic blood pressure)

Multinomial distribution

Generalization of the binomial distribution to multiple categories
Models the probability of counts in several categories with fixed total count
Probability mass function involves multinomial coefficients and category probabilities
Applied in analyzing multi-category outcomes in clinical trials
Forms the basis for multinomial logistic regression in biostatistical modeling

Probability distribution selection

Selecting the appropriate probability distribution is a critical step in statistical analysis, impacting the validity and power of statistical inferences
Proper distribution selection ensures accurate modeling of biological phenomena and reliable interpretation of research results

Criteria for distribution choice

Nature of the data (discrete vs continuous, bounded vs unbounded) guides initial selection
Theoretical considerations based on the underlying biological process
Empirical assessment through visualization (histograms, Q-Q plots) and summary statistics
Goodness-of-fit tests to formally evaluate distributional assumptions
Practical considerations including ease of interpretation and computational feasibility

Common pitfalls in selection

Automatically assuming normality without proper verification
Overlooking the impact of sample size on distribution appearance
Ignoring the presence of outliers or influential observations
Failing to consider domain-specific knowledge in distribution selection
Overreliance on a single criterion (e.g., p-value from a goodness-of-fit test) for distribution choice

🫁Intro to Biostatistics Unit 2 Review

2.2 Probability distributions

🫁Intro to Biostatistics Unit 2 Review

2.2 Probability distributions

Unit & Topic Study Guides

Types of probability distributions

Discrete vs continuous distributions

Univariate vs multivariate distributions

Properties of distributions

Measures of central tendency

Measures of variability

Skewness and kurtosis

Discrete probability distributions

Bernoulli distribution

Binomial distribution

Poisson distribution

Negative binomial distribution

Continuous probability distributions

Normal distribution

Student's t-distribution

Chi-square distribution

F-distribution

Sampling distributions

Distribution of sample mean

Central limit theorem

Standard error

Probability density functions

Probability mass function

Cumulative distribution function

Applications in biostatistics

Disease prevalence estimation

Clinical trial outcomes

Survival analysis

Transformations of distributions

Log-normal distribution

Box-Cox transformation

Goodness-of-fit tests

Kolmogorov-Smirnov test

Anderson-Darling test

Multivariate distributions

Bivariate normal distribution

Multinomial distribution

Probability distribution selection

Criteria for distribution choice

Common pitfalls in selection

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

🫁Intro to Biostatistics
Unit 2 Review