12. G-Square Test
12.1. Summary
The G-square test (likelihood ratio chi-square) is a test of (conditional) independence for discrete variables. It compares the likelihood of a model where two variables are independent given a conditioning set S to a model where they are allowed to depend on each other given S.
12.2. When to use
Variables are discrete (categorical).
Sample sizes per cell are reasonably large.
You want a standard, likelihood-based CI test for PC, CPC, FCI, RFCI, or other constraint-based algorithms on discrete data.
12.3. Assumptions
Variables are discrete with a manageable number of categories.
Expected cell counts in the contingency tables are not too small (as for chi-square-type tests generally).
The multinomial model for counts is a reasonable approximation.
12.4. Test details (conceptual)
For each candidate independence X ⟂ Y | S, the G-square test:
Constructs contingency tables for X, Y, and S.
Compares the log-likelihood of the full model (X and Y possibly dependent given S) to the restricted model (X and Y independent given S).
Forms the test statistic G² = 2 * (logL_full − logL_restricted).
Uses an approximate chi-square distribution with degrees of freedom equal to the difference in the number of parameters to compute a p-value.
12.5. Parameters
Parameter (camelCase) |
Description |
|---|---|
|
Significance level (p-value cutoff) for the G² likelihood-ratio test of (conditional) independence. The null hypothesis is that the variables are independent given the conditioning set. P-values below |
|
Minimum allowed count in each cell of the contingency table. If some cells fall below this threshold, the asymptotic chi-square approximation for the G² statistic becomes less reliable. Increasing this value can improve accuracy but may reduce power when sample size is small. Default is 1; minimum is 1; maximum is 1,000,000. |
|
Optimization choice for how to build contingency tables: |
12.6. Strengths
Standard likelihood-based test for discrete contingency tables.
Works naturally with multinomial models used in discrete Bayes nets.
Symmetric in X and Y and straightforward to interpret.
12.7. Limitations
Can be unreliable when sample sizes per cell are small.
Complexity can grow quickly with the number of categories and conditioning variables.
Not suitable for continuous variables without discretization.
12.8. References
Agresti, A. (2002). Categorical Data Analysis (2nd ed.). Wiley.