3. BPC — Build Pure Clusters

Type: Latent cluster discovery (measurement model)
Output: Disjoint clusters of observed variables interpreted as indicators of latent factors
Reference: Silva, Scheines, Glymour, Spirtes (JMLR, 2006)

BPC (Build Pure Clusters) is a tetrad-based latent clustering procedure inspired by the JMLR paper by Silva et al. It searches for groups of observed variables that:

Are mutually dependent (correlated enough to plausibly share a latent cause), and
Satisfy tetrad constraints consistent with a single latent factor,
Have at least three indicators (the paper’s rule to avoid fragile latents).

The implementation in Tetrad follows the spirit of the original paper but uses a practical, deterministic set of rules for growing, merging, and purifying clusters.

3.1. Basic Assumptions

BPC is intended for:

Continuous, approximately Gaussian indicators
Linear latent factor models
Pure clusters: each observed variable ideally loads on a single latent (no cross-loadings)

It operates on a covariance or correlation matrix, not on raw data, and uses Wilks-based rank tests for tetrads plus Fisher-Z tests for pairwise dependence.

3.2. High-Level Algorithm

BPC proceeds in several stages:

3.2.1. 1. Build a dependence pattern

Compute a pairwise dependence mask canLink[i][j] using Fisher-Z tests on correlations with a relatively loose alpha:
- canLink[i][j] = true if variables i and j are significantly dependent.
This acts as a screen: tetrad checks only consider variables that are pairwise dependent.

Result: a graph-like pattern of variables that can plausibly belong to the same latent cluster.

3.2.2. 2. Enumerate tetrad seeds and grow local groups

For every 4-tuple of variables that passes the dependence screen:
- Test whether all 3 tetrads in that quartet correspond to rank 1 (latent-factor compatible).
- If so, that 4-tuple is a pure seed.
For each pure seed, BPC grows a locally maximal pure group:
- Repeatedly try adding any other variable x:
  - If the enlarged set still passes all tetrad tests, keep x.
- Stop when no more variables can be added without breaking purity.

Important: at this stage variables are not marked as “used”. Different seeds can grow into overlapping candidate groups.

3.2.3. 3. Global merging of compatible groups

From all locally grown candidate groups:

Merge rule:
Repeatedly consider pairs of groups A and B:
- Let U = A ∪ B.
- If U is still tetrad-pure, and its average absolute correlation does not drop too much compared to A and B (by more than a small threshold), then merge:
  - Replace A and B by U.
This merging step is applied iteratively until no more compatible merges are found.

This stage encourages larger, more coherent clusters when a single latent can plausibly explain them.

3.2.4. 4. Resolve overlaps (variable assignment)

After merging, some variables may still appear in multiple groups. BPC resolves these overlaps globally:

For each variable that belongs to more than one group:
- Compute its average absolute correlation with the other variables in each group.
- Optionally also look at how many tetrads involving that variable pass within the group.
- Assign the variable to the best-fitting group:
  - Highest compatibility score (average absolute correlation plus tie-breaking by tetrads).
- Remove it from all other groups.

At the end of this step, the groups become disjoint: each variable is assigned to at most one cluster.

3.2.5. 5. Filter and finalize clusters

Finally, BPC enforces two conditions:

Cluster size:
- Drop any cluster with fewer than 3 indicators (as in the JMLR paper: latents with fewer than 3 children are discarded).
Purity check:
- Make sure each remaining group:
  - Has at least 3 variables, and
  - Either is too small to form tetrads (size 3), or
  - Still passes the tetrad purity checks if size ≥ 4.

The output is a list of disjoint pure clusters, each intended to represent one latent variable.

3.3. Output and Interpretation

The algorithm returns a list of clusters: each cluster is a set of observed variables believed to share a single latent parent.
Clusters are disjoint: no observed variable belongs to more than one latent factor in the BPC solution.
These clusters can be used to:
- Build a measurement model (one latent per cluster), then
- Run a structural discovery algorithm (e.g., PC, GFCI, BOSS-FCI) on the latents.

3.4. Parameters in Tetrad

Parameter	Description
`alpha`	Significance level for tetrad-based rank tests (purity checks). Smaller values demand stronger evidence for a latent structure.
`ess`	Effective sample size used in statistical tests; `-1` means use the actual sample size.
`verbose`	If true, logs details: seeds found, merges, overlap resolutions, and tetrad statistics.

Some internal “knobs” are fixed to practical defaults in the current implementation (not exposed as GUI parameters), such as:

alphaPairs: looser alpha for pairwise dependence screening,
deltaMerge: small allowable drop in average absolute correlation when merging two groups.

3.5. Strengths

Follows the Silva et al. (2006) spirit:
- Find pure measurement clusters using tetrads.
- Apply global purification instead of greedy, once-through clustering.
Allows flexible resolution of overlaps and merges, which can help in noisy or borderline cases.
Works entirely from the covariance structure, without fitting full SEMs.

3.6. Limitations

Requires reasonably large sample sizes for reliable tetrad tests.
Assumes approximately Gaussian, linear, and pure measurement relations.
Will discard latent candidates with fewer than 3 indicators, even if they exist in the data.
Not designed to model cross-loadings; those variables will be forced into a single cluster or may cause that cluster to be rejected.

3.7. Reference

Silva, R., Scheines, R., Glymour, C., & Spirtes, P. (2006). Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7, 191–246.

3.8. Summary

BPC builds latent clusters by:

Screening for pairwise dependence,
Finding tetrad-pure seeds,
Growing them to locally maximal groups,
Globally merging compatible groups,
Resolving overlaps by assigning each variable to its best-fitting group, and
Dropping any small or impure groups.

The result is a set of pure, disjoint indicator clusters that can be interpreted as latent factors and used as input to further causal structure learning.