Fiveable

🫁Intro to Biostatistics Unit 2 Review

QR code for Intro to Biostatistics practice questions

2.2 Probability distributions

🫁Intro to Biostatistics
Unit 2 Review

2.2 Probability distributions

Written by the Fiveable Content Team • Last updated September 2025
Written by the Fiveable Content Team • Last updated September 2025
🫁Intro to Biostatistics
Unit & Topic Study Guides

Probability distributions are essential tools in biostatistics, enabling researchers to model and analyze various biological phenomena. Understanding different types of distributions helps in selecting appropriate statistical tests and interpreting results in medical research and clinical trials.

From discrete to continuous, univariate to multivariate, each distribution type serves specific purposes in biostatistical analysis. Properties like central tendency, variability, and shape provide crucial insights into data behavior, guiding researchers in study design and statistical interpretation.

Types of probability distributions

  • Probability distributions form the foundation of statistical inference in biostatistics, enabling researchers to model and analyze various biological phenomena
  • Understanding different types of distributions helps in selecting appropriate statistical tests and interpreting results in medical research and clinical trials

Discrete vs continuous distributions

  • Discrete distributions deal with countable outcomes (whole numbers) common in biostatistical studies (number of patients, disease occurrences)
  • Continuous distributions represent variables that can take any value within a range, often used for measurements in medical research (blood pressure, drug concentration)
  • Discrete distributions use probability mass functions while continuous distributions employ probability density functions
  • Examples of discrete distributions include Binomial (success/failure in clinical trials) and Poisson (rare disease occurrences)
  • Continuous distributions encompass Normal (height, weight) and Exponential (waiting times between events in healthcare)

Univariate vs multivariate distributions

  • Univariate distributions describe the probability of a single random variable, frequently used in basic biostatistical analyses
  • Multivariate distributions model the joint probability of two or more variables, essential for complex medical studies
  • Univariate distributions help analyze individual patient characteristics (age, BMI)
  • Multivariate distributions enable the study of relationships between multiple health factors (blood pressure and cholesterol levels)
  • Covariance and correlation play crucial roles in understanding multivariate distributions in biostatistical research

Properties of distributions

  • Distribution properties provide insights into data behavior, guiding statistical analysis and interpretation in biomedical research
  • Understanding these properties helps researchers choose appropriate statistical methods and make informed decisions in study design

Measures of central tendency

  • Mean represents the average value, widely used in biostatistics to summarize data (average patient age in a clinical trial)
  • Median indicates the middle value, useful for skewed distributions (median survival time in cancer studies)
  • Mode shows the most frequent value, applicable in discrete data analysis (most common side effect in drug trials)
  • Geometric mean calculates the central tendency for data with multiplicative relationships (bacterial growth rates)
  • Harmonic mean used for rates and speeds in physiological studies (average reaction times in neuroscience experiments)

Measures of variability

  • Standard deviation quantifies the spread of data around the mean, crucial for assessing variability in medical measurements
  • Variance, the squared standard deviation, used in statistical tests and ANOVA in biomedical research
  • Range provides a simple measure of spread, indicating the difference between the highest and lowest values in a dataset
  • Interquartile range (IQR) measures spread in the middle 50% of data, robust to outliers in clinical data
  • Coefficient of variation (CV) allows comparison of variability between different scales, useful in comparing lab test precision

Skewness and kurtosis

  • Skewness measures the asymmetry of a distribution, important for identifying non-normal data in biostatistics
  • Positive skew indicates a long right tail (rare but extreme high values in drug response studies)
  • Negative skew shows a long left tail (occasional very low values in physiological measurements)
  • Kurtosis quantifies the "tailedness" of a distribution, affecting the reliability of statistical tests
  • Leptokurtic distributions have heavier tails, often seen in gene expression data
  • Platykurtic distributions have lighter tails, sometimes observed in anthropometric measurements

Discrete probability distributions

  • Discrete distributions model countable outcomes in biostatistical research, essential for analyzing categorical data and event counts
  • These distributions play a crucial role in designing and interpreting results from clinical trials and epidemiological studies

Bernoulli distribution

  • Models a single trial with two possible outcomes (success or failure)
  • Probability mass function given by P(X=x)=px(1p)1xP(X=x) = p^x(1-p)^{1-x} where x is 0 or 1
  • Used in modeling presence/absence of a disease or treatment response in individual patients
  • Mean of Bernoulli distribution equals p, the probability of success
  • Variance calculated as p(1p)p(1-p), important for determining sample size in clinical trials

Binomial distribution

  • Represents the number of successes in a fixed number of independent Bernoulli trials
  • Probability mass function P(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k}p^k(1-p)^{n-k}
  • Widely used in clinical trials to model the number of patients responding to a treatment
  • Mean of binomial distribution is np, where n is the number of trials
  • Variance given by np(1p)np(1-p), crucial for power calculations in study design

Poisson distribution

  • Models the number of events occurring in a fixed interval of time or space
  • Probability mass function P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}
  • Applied in rare disease occurrence studies and modeling adverse events in drug safety
  • Mean and variance both equal to λ, the rate parameter
  • Approximates binomial distribution when n is large and p is small

Negative binomial distribution

  • Describes the number of failures before a specified number of successes occur
  • Used in modeling the number of disease-free days before a relapse in chronic conditions
  • Probability mass function P(X=k)=(k+r1k)pr(1p)kP(X=k) = \binom{k+r-1}{k}p^r(1-p)^k
  • Mean given by r(1p)p\frac{r(1-p)}{p}, where r is the number of successes
  • Variance calculated as r(1p)p2\frac{r(1-p)}{p^2}, often used in overdispersed count data analysis

Continuous probability distributions

  • Continuous distributions model variables that can take any value within a range, crucial for analyzing measurements in biomedical research
  • These distributions underpin many statistical tests and models used in biostatistics, from t-tests to regression analysis

Normal distribution

  • Symmetric, bell-shaped distribution fundamental to many statistical methods in biostatistics
  • Probability density function f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
  • Characterized by mean (μ) and standard deviation (σ)
  • Central to the Central Limit Theorem, justifying many parametric tests in large samples
  • Z-scores derived from normal distribution used for standardizing and comparing different scales

Student's t-distribution

  • Similar to normal distribution but with heavier tails, crucial for small sample inference
  • Used in t-tests and confidence intervals for means in biomedical studies
  • Shape determined by degrees of freedom, approaching normal distribution as df increases
  • Probability density function involves gamma functions and is more complex than normal
  • Critical in analyzing small sample sizes common in early-phase clinical trials

Chi-square distribution

  • Arises from the sum of squared standard normal variables
  • Degrees of freedom determine the shape of the distribution
  • Used in goodness-of-fit tests and analysis of categorical data in epidemiology
  • Forms the basis for tests of independence in contingency tables (e.g., case-control studies)
  • Plays a role in confidence intervals for population variance in laboratory studies

F-distribution

  • Ratio of two chi-square distributions divided by their respective degrees of freedom
  • Fundamental to Analysis of Variance (ANOVA), widely used in comparing multiple groups
  • Shape determined by two parameters: degrees of freedom for numerator and denominator
  • Critical in assessing the significance of added variables in multiple regression models
  • Used in testing equality of variances (e.g., assessing homogeneity in multi-center trials)

Sampling distributions

  • Sampling distributions describe the behavior of sample statistics, crucial for inferential statistics in biomedical research
  • Understanding these distributions enables researchers to make inferences about population parameters from sample data

Distribution of sample mean

  • Describes the variability of sample means across different samples from the same population
  • For normal populations, sample mean follows a normal distribution regardless of sample size
  • Standard error of the mean (SEM) quantifies the standard deviation of the sampling distribution
  • SEM decreases as sample size increases, improving precision of estimates in larger studies
  • Forms the basis for constructing confidence intervals for population means in clinical research

Central limit theorem

  • States that the sampling distribution of the mean approaches a normal distribution as sample size increases
  • Applies regardless of the underlying population distribution, with some exceptions
  • Crucial for justifying the use of parametric tests in large samples, even for non-normal data
  • Generally considered applicable when sample size exceeds 30 for most distributions
  • Enables the use of z-scores and normal probabilities in inferential statistics

Standard error

  • Measures the variability of a sample statistic (e.g., mean, proportion) across different samples
  • Calculated as the standard deviation of the sampling distribution
  • For means, standard error = σn\frac{\sigma}{\sqrt{n}}, where σ is population standard deviation
  • Decreases as sample size increases, reflecting increased precision in larger studies
  • Used in constructing confidence intervals and conducting hypothesis tests in biostatistics

Probability density functions

  • Probability density functions (PDFs) and their discrete counterparts are fundamental tools for describing and analyzing probability distributions in biostatistics
  • These functions enable the calculation of probabilities and form the basis for many statistical inference techniques

Probability mass function

  • Describes the probability distribution for discrete random variables
  • Gives the probability that a discrete random variable equals a specific value
  • Sum of probabilities over all possible values equals 1
  • Used in modeling count data (number of adverse events, disease occurrences)
  • Forms the basis for likelihood calculations in discrete data analysis

Cumulative distribution function

  • Represents the probability that a random variable takes a value less than or equal to a given value
  • Applies to both discrete and continuous distributions
  • For continuous distributions, CDF is the integral of the probability density function
  • Used in calculating percentiles and quantiles in biostatistical analyses
  • Critical in survival analysis for estimating probabilities of events occurring by certain times

Applications in biostatistics

  • Probability distributions find extensive applications in various areas of biostatistics, from epidemiology to clinical trials
  • Understanding these applications helps researchers choose appropriate statistical methods and interpret results accurately

Disease prevalence estimation

  • Binomial distribution used to model the number of disease cases in a population sample
  • Normal approximation to binomial enables confidence interval calculation for large samples
  • Beta distribution often used as a prior in Bayesian estimation of disease prevalence
  • Poisson distribution applied in rare disease prevalence studies
  • Negative binomial distribution useful for overdispersed count data in disease mapping

Clinical trial outcomes

  • Bernoulli trials model individual patient outcomes (success/failure) in clinical trials
  • Binomial distribution describes the number of successes in fixed-size trials
  • Normal distribution approximates treatment effects in large randomized controlled trials
  • Student's t-distribution used for small sample inference in early-phase trials
  • Survival distributions (Weibull, exponential) model time-to-event outcomes in long-term studies

Survival analysis

  • Exponential distribution models constant hazard rates in survival studies
  • Weibull distribution allows for increasing or decreasing hazard rates over time
  • Log-normal distribution used for modeling survival times with early peak hazard
  • Gamma distribution provides flexible modeling of survival times in complex scenarios
  • Cox proportional hazards model uses partial likelihood based on hazard distributions

Transformations of distributions

  • Transformations help in dealing with non-normal data, enabling the use of parametric methods and improving model fit in biostatistical analyses
  • Understanding these transformations is crucial for handling skewed or heteroscedastic data common in biomedical research

Log-normal distribution

  • Arises when the logarithm of a variable follows a normal distribution
  • Often used for modeling biological variables with positive skew (drug concentrations, antibody levels)
  • Probability density function involves the natural logarithm of the variable
  • Geometric mean and geometric standard deviation are key parameters
  • Useful in pharmacokinetic studies and modeling growth rates in microbiology

Box-Cox transformation

  • Family of power transformations to approximate normal distribution
  • Includes logarithmic transformation as a special case
  • Formula: y(λ)=yλ1λy(\lambda) = \frac{y^\lambda - 1}{\lambda} for λ ≠ 0, and log(y) for λ = 0
  • Optimal λ chosen to maximize normality, often through maximum likelihood estimation
  • Applied in regression analysis to stabilize variance and improve model fit in biomedical data

Goodness-of-fit tests

  • Goodness-of-fit tests assess how well observed data conform to a theoretical probability distribution
  • These tests are crucial in validating distributional assumptions underlying many statistical methods in biostatistics

Kolmogorov-Smirnov test

  • Non-parametric test comparing the cumulative distribution of sample data to a reference distribution
  • Calculates the maximum distance between empirical and theoretical cumulative distributions
  • Sensitive to differences in both location and shape of the distributions
  • Used for testing normality and other continuous distributions in biomedical data
  • Limitations include reduced sensitivity to tail differences and discrete distributions

Anderson-Darling test

  • Modification of the Kolmogorov-Smirnov test with greater sensitivity to tail differences
  • Gives more weight to the tails of the distribution in the test statistic calculation
  • Often preferred for testing normality in biostatistical applications
  • More powerful than Kolmogorov-Smirnov for detecting departures from normality
  • Critical values depend on the specific distribution being tested

Multivariate distributions

  • Multivariate distributions model the joint behavior of two or more random variables, essential for analyzing complex relationships in biomedical data
  • Understanding these distributions is crucial for advanced statistical techniques like multivariate regression and factor analysis

Bivariate normal distribution

  • Extension of univariate normal distribution to two dimensions
  • Characterized by means, standard deviations, and correlation coefficient of two variables
  • Probability density function involves a complex exponential term with covariance matrix
  • Contours of equal probability form ellipses in the two-dimensional plane
  • Used in modeling paired measurements (systolic and diastolic blood pressure)

Multinomial distribution

  • Generalization of the binomial distribution to multiple categories
  • Models the probability of counts in several categories with fixed total count
  • Probability mass function involves multinomial coefficients and category probabilities
  • Applied in analyzing multi-category outcomes in clinical trials
  • Forms the basis for multinomial logistic regression in biostatistical modeling

Probability distribution selection

  • Selecting the appropriate probability distribution is a critical step in statistical analysis, impacting the validity and power of statistical inferences
  • Proper distribution selection ensures accurate modeling of biological phenomena and reliable interpretation of research results

Criteria for distribution choice

  • Nature of the data (discrete vs continuous, bounded vs unbounded) guides initial selection
  • Theoretical considerations based on the underlying biological process
  • Empirical assessment through visualization (histograms, Q-Q plots) and summary statistics
  • Goodness-of-fit tests to formally evaluate distributional assumptions
  • Practical considerations including ease of interpretation and computational feasibility

Common pitfalls in selection

  • Automatically assuming normality without proper verification
  • Overlooking the impact of sample size on distribution appearance
  • Ignoring the presence of outliers or influential observations
  • Failing to consider domain-specific knowledge in distribution selection
  • Overreliance on a single criterion (e.g., p-value from a goodness-of-fit test) for distribution choice