# Chi-Square Test

## Summary

The Chi-square test of independence is a standard contingency-table test for
**discrete** variables. In Tetrad, it is used as a CI test for categorical
variables by comparing observed counts to expected counts under independence.

## When to use

- Data are **discrete** (categorical).
- You want a classical Pearson chi-square test instead of the likelihood ratio
  (G-square) test.
- Sample sizes per cell are moderately large.

## Assumptions

- Multinomial sampling with fixed margins is approximately valid.
- Expected cell counts are not too small (a common rule of thumb is at least 5
  in most cells).
- Variables and conditioning sets are discrete with moderate arity.

## Test details (conceptual)

For each candidate independence X ⟂ Y | S:

1. Form contingency tables of counts for X and Y given each configuration of S.
2. Compute expected counts under the assumption that X and Y are independent
   given S.
3. Compute Pearson’s chi-square statistic as the sum over cells of
   (observed − expected)² / expected.
4. Use a chi-square distribution with appropriate degrees of freedom to obtain
   a p-value.

## Parameters

| Parameter (camelCase)   | Description |
|-------------------------|-------------|
| `alpha`                 | Significance level (p-value cutoff) for the chi-square test of (conditional) independence. The null hypothesis is that the variables are independent given the conditioning set. P-values below `alpha` lead to rejection. Smaller values make the test more conservative (fewer edges); larger values make the graph denser. Typical range: 0.0–1.0. |
| `minCountPerCell`       | Minimum allowed count in each cell of the contingency table. If some cells fall below this threshold, the chi-square approximation becomes less reliable. Increasing this value can improve accuracy but may reduce power when sample size is small. Default is 1; minimum is 1; maximum is 1,000,000. |
| `cellTableType`         | Optimization choice for how to build contingency tables: `1 = AD Tree`, `2 = Count Sample`. This affects how counts are computed internally (data structure / performance), but should not change the numerical results. Default is 1 (AD Tree). |
| `effectiveSampleSize`   | The effective sample size to use in computing p-values. If set to `-1` (the default), the actual data sample size is used. If set to a positive integer, the test behaves as if that were the sample size, which can be useful for reweighted or subsampled data. |

## Strengths

- Widely known and understood.
- Easy to implement and interpret.
- Works well when cell counts are sufficiently large.

## Limitations

- Performs poorly with **sparse tables** (many small expected counts).
- Not appropriate for continuous data without discretization.
- As conditioning sets grow, tables can become very large and sparse.

## References

- Agresti, A. (2002). *Categorical Data Analysis* (2nd ed.). Wiley.