3. BPC β Build Pure Clustersο
Type: Latent cluster discovery (measurement model)
Output: Disjoint clusters of observed variables interpreted as indicators of latent factors
Reference: Silva, Scheines, Glymour, Spirtes (JMLR, 2006)
BPC (Build Pure Clusters) is a tetrad-based latent clustering procedure inspired by the JMLR paper by Silva et al. It searches for groups of observed variables that:
Are mutually dependent (correlated enough to plausibly share a latent cause), and
Satisfy tetrad constraints consistent with a single latent factor,
Have at least three indicators (the paperβs rule to avoid fragile latents).
The implementation in Tetrad follows the spirit of the original paper but uses a practical, deterministic set of rules for growing, merging, and purifying clusters.
3.1. Basic Assumptionsο
BPC is intended for:
Continuous, approximately Gaussian indicators
Linear latent factor models
Pure clusters: each observed variable ideally loads on a single latent (no cross-loadings)
It operates on a covariance or correlation matrix, not on raw data, and uses Wilks-based rank tests for tetrads plus Fisher-Z tests for pairwise dependence.
3.2. High-Level Algorithmο
BPC proceeds in several stages:
3.2.1. 1. Build a dependence patternο
Compute a pairwise dependence mask
canLink[i][j]using Fisher-Z tests on correlations with a relatively loose alpha:canLink[i][j] = trueif variables i and j are significantly dependent.
This acts as a screen: tetrad checks only consider variables that are pairwise dependent.
Result: a graph-like pattern of variables that can plausibly belong to the same latent cluster.
3.2.2. 2. Enumerate tetrad seeds and grow local groupsο
For every 4-tuple of variables that passes the dependence screen:
Test whether all 3 tetrads in that quartet correspond to rank 1 (latent-factor compatible).
If so, that 4-tuple is a pure seed.
For each pure seed, BPC grows a locally maximal pure group:
Repeatedly try adding any other variable x:
If the enlarged set still passes all tetrad tests, keep x.
Stop when no more variables can be added without breaking purity.
Important: at this stage variables are not marked as βusedβ. Different seeds can grow into overlapping candidate groups.
3.2.3. 3. Global merging of compatible groupsο
From all locally grown candidate groups:
Merge rule:
Repeatedly consider pairs of groups A and B:Let U = A βͺ B.
If U is still tetrad-pure, and its average absolute correlation does not drop too much compared to A and B (by more than a small threshold), then merge:
Replace A and B by U.
This merging step is applied iteratively until no more compatible merges are found.
This stage encourages larger, more coherent clusters when a single latent can plausibly explain them.
3.2.4. 4. Resolve overlaps (variable assignment)ο
After merging, some variables may still appear in multiple groups. BPC resolves these overlaps globally:
For each variable that belongs to more than one group:
Compute its average absolute correlation with the other variables in each group.
Optionally also look at how many tetrads involving that variable pass within the group.
Assign the variable to the best-fitting group:
Highest compatibility score (average absolute correlation plus tie-breaking by tetrads).
Remove it from all other groups.
At the end of this step, the groups become disjoint: each variable is assigned to at most one cluster.
3.2.5. 5. Filter and finalize clustersο
Finally, BPC enforces two conditions:
Cluster size:
Drop any cluster with fewer than 3 indicators (as in the JMLR paper: latents with fewer than 3 children are discarded).
Purity check:
Make sure each remaining group:
Has at least 3 variables, and
Either is too small to form tetrads (size 3), or
Still passes the tetrad purity checks if size β₯ 4.
The output is a list of disjoint pure clusters, each intended to represent one latent variable.
3.3. Output and Interpretationο
The algorithm returns a list of clusters: each cluster is a set of observed variables believed to share a single latent parent.
Clusters are disjoint: no observed variable belongs to more than one latent factor in the BPC solution.
These clusters can be used to:
Build a measurement model (one latent per cluster), then
Run a structural discovery algorithm (e.g., PC, GFCI, BOSS-FCI) on the latents.
3.4. Parameters in Tetradο
Parameter |
Description |
|---|---|
|
Significance level for tetrad-based rank tests (purity checks). Smaller values demand stronger evidence for a latent structure. |
|
Effective sample size used in statistical tests; |
|
If true, logs details: seeds found, merges, overlap resolutions, and tetrad statistics. |
Some internal βknobsβ are fixed to practical defaults in the current implementation (not exposed as GUI parameters), such as:
alphaPairs: looser alpha for pairwise dependence screening,deltaMerge: small allowable drop in average absolute correlation when merging two groups.
3.5. Strengthsο
Follows the Silva et al. (2006) spirit:
Find pure measurement clusters using tetrads.
Apply global purification instead of greedy, once-through clustering.
Allows flexible resolution of overlaps and merges, which can help in noisy or borderline cases.
Works entirely from the covariance structure, without fitting full SEMs.
3.6. Limitationsο
Requires reasonably large sample sizes for reliable tetrad tests.
Assumes approximately Gaussian, linear, and pure measurement relations.
Will discard latent candidates with fewer than 3 indicators, even if they exist in the data.
Not designed to model cross-loadings; those variables will be forced into a single cluster or may cause that cluster to be rejected.
3.7. Referenceο
Silva, R., Scheines, R., Glymour, C., & Spirtes, P. (2006). Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7, 191β246.
3.8. Summaryο
BPC builds latent clusters by:
Screening for pairwise dependence,
Finding tetrad-pure seeds,
Growing them to locally maximal groups,
Globally merging compatible groups,
Resolving overlaps by assigning each variable to its best-fitting group, and
Dropping any small or impure groups.
The result is a set of pure, disjoint indicator clusters that can be interpreted as latent factors and used as input to further causal structure learning.