Chi-square tests are powerful tools for analyzing categorical data. They come in three flavors: goodness-of-fit, independence, and homogeneity tests. Each type helps us understand different aspects of categorical data distributions and relationships.
These tests use observed and expected frequencies to calculate a chi-square statistic. This statistic, along with degrees of freedom, determines the p-value or critical value for making statistical decisions about our data.
Chi-Square Tests
Goodness-of-fit vs independence tests
- Goodness-of-fit test assesses whether a sample of categorical data matches a hypothesized distribution by comparing observed frequencies to expected frequencies based on the hypothesized distribution (colors of M&Ms in a bag vs claimed proportions)
- Independence test evaluates if two categorical variables are independent of each other by comparing observed frequencies to expected frequencies assuming independence (relationship between gender and product preference)
- Homogeneity test determines if the distribution of a categorical variable remains consistent across multiple populations by comparing observed frequencies to expected frequencies assuming homogeneity (proportion of voters supporting a candidate across different age groups)
Null and alternative hypotheses for chi-square
- Goodness-of-fit test
- $H_0$: Sample data follows the hypothesized distribution
- $H_a$: Sample data does not follow the hypothesized distribution
- Independence test
- $H_0$: The two categorical variables are independent
- $H_a$: The two categorical variables are not independent (associated)
- Homogeneity test
- $H_0$: The distribution of the categorical variable remains the same across all populations
- $H_a$: The distribution of the categorical variable differs across populations
Populations and variables in chi-square tests
- Goodness-of-fit test involves one population and one categorical variable, comparing the observed distribution to a hypothesized distribution (colors of M&Ms in a single bag vs expected proportions)
- Independence test examines one population with two categorical variables, investigating the relationship between the two variables (gender and product preference within a single market)
- Data for independence tests is typically organized in a contingency table
- Homogeneity test compares two or more populations using one categorical variable, assessing if the variable's distribution remains consistent across the populations (voter support for a candidate across different age groups or geographic regions)
Chi-Square Test Statistics and Interpretation
- The chi-square statistic measures the overall difference between observed and expected frequencies
- Degrees of freedom for the chi-square test depend on the number of categories and populations involved
- The p-value is calculated based on the chi-square statistic and degrees of freedom
- A critical value can be determined from a chi-square distribution table for comparison with the calculated chi-square statistic