Probability theory forms the backbone of causal inference, providing tools to quantify uncertainty and make informed decisions. It introduces key concepts like probability distributions, independence, and conditional probability, which are essential for understanding cause-and-effect relationships.
Mastering probability theory enables researchers to model complex scenarios, estimate causal effects, and assess the strength of evidence. From basic axioms to advanced concepts like Bayes' theorem and limit theorems, probability theory equips us with the necessary framework to tackle causal inference challenges.
Basics of probability
- Probability is a fundamental concept in statistics and causal inference that quantifies the likelihood of an event occurring
- Understanding probability is crucial for making inferences about cause-and-effect relationships and assessing the strength of evidence for causal claims
Probability axioms
- Non-negativity: Probability of an event is always greater than or equal to 0, $P(A) \geq 0$
- Normalization: Probability of the entire sample space is equal to 1, $P(S) = 1$
- Additivity: If events A and B are mutually exclusive, then $P(A \cup B) = P(A) + P(B)$
- Complementary events: The probability of an event A and its complement A' sum to 1, $P(A) + P(A') = 1$
Sample spaces and events
- Sample space (S) is the set of all possible outcomes of a random experiment (coin toss, rolling a die)
- An event (A) is a subset of the sample space, representing a specific outcome or group of outcomes (getting heads, rolling an even number)
- Events can be simple (a single outcome) or compound (a combination of outcomes)
- Mutually exclusive events cannot occur simultaneously (rolling a 1 and rolling a 6 on a single die roll)
Conditional probability
- Conditional probability $P(A|B)$ is the probability of event A occurring given that event B has already occurred
- Calculated as $P(A|B) = \frac{P(A \cap B)}{P(B)}$, where $P(A \cap B)$ is the probability of both A and B occurring
- Allows for updating probabilities based on new information or evidence
- Helps in understanding the dependence between events and is crucial for causal inference
Probability distributions
- A probability distribution is a function that describes the likelihood of different outcomes in a random variable
- Probability distributions are essential for modeling uncertainty and variability in causal inference
Discrete probability distributions
- Discrete random variables have a countable number of possible outcomes (number of defective items in a batch)
- Probability mass function (PMF) assigns probabilities to each possible outcome
- Examples include Bernoulli, binomial, and Poisson distributions
Continuous probability distributions
- Continuous random variables can take on any value within a specified range (height, weight)
- Probability density function (PDF) describes the relative likelihood of different values
- Examples include normal, exponential, and uniform distributions
- Probabilities are calculated using integrals of the PDF over a given range
Joint probability distributions
- Joint probability distribution describes the probabilities of two or more random variables occurring together
- Denoted as $P(X, Y)$ for random variables X and Y
- Allows for modeling the dependence between multiple variables
- Marginal and conditional probabilities can be derived from the joint distribution
Marginal probability distributions
- Marginal probability distribution is the probability distribution of a single random variable, ignoring the others
- Obtained by summing (discrete) or integrating (continuous) the joint distribution over the other variables
- Provides information about the individual behavior of a random variable
- Useful for simplifying complex joint distributions and focusing on specific variables of interest
Independence and dependence
- Independence and dependence describe the relationship between events or random variables
- Understanding these concepts is crucial for correctly modeling and interpreting causal relationships
Independent events
- Events A and B are independent if the occurrence of one does not affect the probability of the other
- Mathematically, $P(A|B) = P(A)$ and $P(B|A) = P(B)$
- For independent events, the joint probability is the product of the individual probabilities, $P(A \cap B) = P(A) \times P(B)$
- Example: Flipping a fair coin twice, the outcome of the second flip is independent of the first
Dependent events
- Events A and B are dependent if the occurrence of one affects the probability of the other
- Mathematically, $P(A|B) \neq P(A)$ or $P(B|A) \neq P(B)$
- The joint probability of dependent events is not equal to the product of their individual probabilities
- Example: Drawing cards from a deck without replacement, the probability of drawing a specific card changes after each draw
Conditional independence
- Events A and B are conditionally independent given event C if $P(A|B,C) = P(A|C)$ and $P(B|A,C) = P(B|C)$
- Conditional independence implies that once we know the outcome of C, the occurrence of A does not provide any additional information about B, and vice versa
- Plays a crucial role in causal inference, as it helps in identifying confounding factors and estimating causal effects
Bayes' theorem
- Bayes' theorem is a fundamental rule in probability theory that describes how to update probabilities based on new evidence
- It is named after the Reverend Thomas Bayes, an 18th-century British statistician and Presbyterian minister
Bayes' rule
- Bayes' rule states that the probability of an event A given event B is equal to the probability of event B given A, multiplied by the probability of A, divided by the probability of B
- Mathematically, $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
- Allows for updating prior probabilities (before observing evidence) to posterior probabilities (after observing evidence)
- Example: In medical testing, Bayes' rule can be used to calculate the probability of a patient having a disease given a positive test result
Prior vs posterior probabilities
- Prior probability $P(A)$ is the initial probability of an event A before observing any evidence
- Posterior probability $P(A|B)$ is the updated probability of event A after observing evidence B
- Bayes' rule provides a way to calculate the posterior probability by combining the prior probability with the likelihood of the evidence
- Example: Prior probability of a patient having a disease based on population prevalence, updated to a posterior probability after a positive test result
Bayesian inference
- Bayesian inference is a method of statistical inference that uses Bayes' theorem to update probabilities as more evidence becomes available
- Involves specifying a prior distribution for the parameters of interest, then updating it with observed data to obtain a posterior distribution
- Allows for incorporating prior knowledge and beliefs into the analysis
- Widely used in causal inference for estimating causal effects, handling missing data, and assessing the sensitivity of results to assumptions
Expectation and variance
- Expectation and variance are two fundamental concepts in probability theory that describe the central tendency and variability of a random variable
- They are essential for summarizing and comparing probability distributions in causal inference
Expected value
- The expected value (or mean) of a random variable X, denoted as $E(X)$, is the average value of X over its entire range
- For a discrete random variable, $E(X) = \sum_{x} x \times P(X=x)$, where $x$ are the possible values of X
- For a continuous random variable, $E(X) = \int_{-\infty}^{\infty} x \times f(x) dx$, where $f(x)$ is the probability density function
- Represents the long-run average value of the random variable if the experiment is repeated many times
Variance and standard deviation
- Variance, denoted as $Var(X)$ or $\sigma^2$, measures the average squared deviation of a random variable X from its expected value
- Calculated as $Var(X) = E[(X - E(X))^2]$
- Standard deviation, denoted as $\sigma$, is the square root of the variance and measures the average deviation from the mean
- Both variance and standard deviation quantify the spread or dispersion of a probability distribution
Covariance and correlation
- Covariance, denoted as $Cov(X,Y)$, measures the joint variability of two random variables X and Y
- Calculated as $Cov(X,Y) = E[(X - E(X))(Y - E(Y))]$
- A positive covariance indicates that X and Y tend to increase or decrease together, while a negative covariance suggests an inverse relationship
- Correlation, denoted as $\rho(X,Y)$, is a standardized version of covariance that ranges from -1 to 1
- Calculated as $\rho(X,Y) = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}$, where $\sigma_X$ and $\sigma_Y$ are the standard deviations of X and Y
- Correlation measures the strength and direction of the linear relationship between two variables
Common probability distributions
- Probability distributions are mathematical functions that describe the likelihood of different outcomes for a random variable
- Understanding common probability distributions is essential for modeling and analyzing data in causal inference
Bernoulli and binomial distributions
- Bernoulli distribution models a single trial with two possible outcomes (success or failure), with a fixed probability of success $p$
- Probability mass function: $P(X=1) = p$ and $P(X=0) = 1-p$
- Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials
- Probability mass function: $P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$, where $n$ is the number of trials and $k$ is the number of successes
Poisson distribution
- Models the number of events occurring in a fixed interval of time or space, given a constant average rate of occurrence
- Probability mass function: $P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}$, where $\lambda$ is the average rate of occurrence
- Often used to model rare events, such as the number of defects in a manufacturing process or the number of accidents in a given time period
Normal distribution
- Also known as the Gaussian distribution, it is a continuous probability distribution that is symmetric and bell-shaped
- Probability density function: $f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$, where $\mu$ is the mean and $\sigma$ is the standard deviation
- Many natural phenomena and measurement errors follow a normal distribution
- Central Limit Theorem states that the sum or average of a large number of independent random variables will be approximately normally distributed
Exponential distribution
- Models the time between events in a Poisson process, or the time until a specific event occurs
- Probability density function: $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$, where $\lambda$ is the rate parameter
- Memoryless property: The probability of an event occurring in the next time interval does not depend on how much time has already passed
- Often used to model waiting times, such as the time between customer arrivals or the time until a machine failure
Limit theorems
- Limit theorems are fundamental results in probability theory that describe the behavior of random variables and their distributions as the sample size increases
- They are crucial for making inferences and justifying statistical methods in causal inference
Law of large numbers
- States that the average of a large number of independent and identically distributed (i.i.d.) random variables will converge to their expected value as the sample size increases
- Weak law of large numbers: The sample mean converges in probability to the expected value
- Strong law of large numbers: The sample mean converges almost surely to the expected value
- Provides a theoretical justification for using sample averages to estimate population means
Central limit theorem
- States that the sum or average of a large number of i.i.d. random variables will be approximately normally distributed, regardless of the underlying distribution
- More precisely, if $X_1, X_2, ..., X_n$ are i.i.d. random variables with mean $\mu$ and variance $\sigma^2$, then $\frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}}$ converges in distribution to a standard normal random variable as $n \to \infty$
- Allows for using normal-based inference methods, such as confidence intervals and hypothesis tests, for non-normal data when the sample size is large
Convergence in probability vs distribution
- Convergence in probability: A sequence of random variables $X_n$ converges in probability to a random variable $X$ if, for any $\epsilon > 0$, $P(|X_n - X| > \epsilon) \to 0$ as $n \to \infty$
- Convergence in distribution: A sequence of random variables $X_n$ converges in distribution to a random variable $X$ if $\lim_{n \to \infty} F_{X_n}(x) = F_X(x)$ for all continuity points $x$ of $F_X$, where $F_{X_n}$ and $F_X$ are the cumulative distribution functions of $X_n$ and $X$, respectively
- Convergence in probability is a stronger notion than convergence in distribution
- Both types of convergence are important in causal inference for establishing the asymptotic properties of estimators and test statistics
Probability in causal inference
- Probability plays a crucial role in causal inference by quantifying the uncertainty associated with cause-and-effect relationships
- It provides a framework for defining and estimating causal effects, assessing the strength of evidence, and making predictions under different scenarios
Probability of causation
- The probability of causation (PC) is the probability that an outcome would not have occurred in the absence of a particular cause
- Formally, $PC = P(Y_0 = 0 | Y_1 = 1)$, where $Y_1$ is the observed outcome under the presence of the cause, and $Y_0$ is the counterfactual outcome under the absence of the cause
- Quantifies the extent to which a cause is responsible for an observed effect
- Helps in attributing outcomes to specific causes and making causal attributions
Probability of necessity and sufficiency
- The probability of necessity (PN) is the probability that an outcome would not have occurred if the cause had been absent
- Formally, $PN = P(Y_0 = 0 | Y = 1)$, where $Y$ is the observed outcome
- The probability of sufficiency (PS) is the probability that an outcome would have occurred if the cause had been present
- Formally, $PS = P(Y_1 = 1 | Y = 0)$
- PN and PS provide information about the causal relationship between a cause and an effect
- High PN suggests that the cause is necessary for the effect, while high PS suggests that the cause is sufficient for the effect
Probability and counterfactuals
- Counterfactuals are hypothetical scenarios that describe what would have happened under different causal conditions
- In causal inference, counterfactuals are used to define causal effects and reason about cause-and-effect relationships
- Probability is used to express the uncertainty associated with counterfactual outcomes
- For example, the average causal effect (ACE) can be defined as $ACE = E[Y_1 - Y_0] = E[Y_1] - E[Y_0]$, where $Y_1$ and $Y_0$ are the potential outcomes under treatment and control, respectively
- Counterfactual probabilities, such as $P(Y_1 = 1)$ and $P(Y_0 = 1)$, are used to estimate causal effects from observational data
- Probability and counterfactuals provide a unified framework for causal reasoning and inference