50. TSC — Trek Separation Clusters
Category: Latent clusters / measurement structure
Code: edu.cmu.tetrad.search.Tsc
TSC (“Trek Separation Clusters”) is a rank-based latent cluster finder. Given a covariance (or correlation) matrix over observed variables, it searches for disjoint groups of indicators that behave as if they share one or more latent parents.
The implementation in Tetrad is a NOLAC version (“no overlapping clusters”): each observed variable is assigned to at most one cluster.
50.1. Intended use
Use TSC when:
You believe your data come (approximately) from a linear-Gaussian latent variable model with pure indicators.
Each observed variable should load on one latent at most (no overlapping measurement clusters).
You want an automatic partition of variables into clusters that can then be fed into higher-level latent-structure algorithms (e.g., FOFC/FTFC/GFFC/BPC or manual SEM modeling).
TSC works directly on a covariance matrix (internally converted to a correlation matrix) and is therefore agnostic to the original sample size and data format, provided you give a correct effective sample size.
50.2. Model assumptions (NOLAC version)
TSC is designed to be (asymptotically) sound under the following stylized assumptions:
Linear-Gaussian SEM with a latent DAG
Latent variables form a DAG.
Observed indicators are linear functions of their latent parents plus independent Gaussian noise.
Pure measurement within clusters
Each observed variable has exactly one latent parent (within the latent structure you care about).
Unique errors of indicators belonging to different clusters are independent.
No overlapping clusters (NOLAC)
Each observed variable belongs to at most one true cluster.
Generic parameters
There are no exact parameter cancellations that artificially change matrix ranks.
Consistent rank test
Cross-block matrix rank is tested using Wilks’ likelihood-ratio test (or a similar rank test), with a significance level
alphathat decreases withn(for example,alpha = 1 / log n) or with an information-criterion cutoff.
Under these assumptions, the algorithm aims to recover, asymptotically, the true indicator clusters and their “boundary ranks” (typically 1 for a single latent).
50.3. High-level algorithm sketch
Let V be the index set of all observed variables, and let Sigma be the
correlation matrix. For a candidate cluster C ⊂ V, define D = V \ C.
TSC examines the rank of the cross-covariance block Sigma[C, D].
Very roughly:
Base clusters by rank
For each rank
r = 0, …, rMax:Let the base size be
|C| = r + 1.Enumerate all subsets
Cof that size.Keep those where
rank(Sigma[C, D]) = raccording to a Wilks rank test with parameters
(nEff, alpha).
Seed and grow clusters (union / extension)
Starting from unused base subsets at rank
r, pick a seed cluster.Try to union it with other overlapping base subsets at the same rank.
Accept a union
C' = C ∪ C_candidatewhen:rank(Sigma[C', V \ C'])is stillr, and|C'| >= r + 1 + minRedundancy.
This iteratively produces maximal rank-
rclusters that do not overlap previously committed variables.
Bifactor-style augmentation
For each new cluster
C1, look for a nearby variable that, when added, causes a one-step rank drop:rank(Sigma[C1, V \ C1]) = rrank(Sigma[C2, V \ C2]) = r - 1, whereC2 = C1 ∪ {x}and
|C2| = |C1| + 1.
This step is designed to capture bifactor-like or multi-dimensional boundaries in which adding a “spanning” indicator reduces cross-rank by 1.
Candidate augmentations that create internal rank-0 splits are rejected (see step 5).
Conditional-rank refinement (Rule 3)
For each candidate cluster
Cof intended rankr_C:Let
D = V \ C.Search for subsets
Z ⊂ Cwith|Z| >= r_Csuch thatrank(C \ Z, D | Z) = 0.If such a
Zis found, it is treated as an observed bottleneck or mediator; the variables inZare removed fromC.Repeat until no such
Zremains or the cluster becomes too small.
Intuition: in a purely observed DAG (no latents), a mediator can d-separate
C \ ZfromDwhen conditioned on. This pattern should not persist for a genuine latent cluster under the measurement assumptions above.Internal rank-0 check
Finally, for each refined cluster
C:Examine nontrivial splits
C = C1 ∪ C2and testrank(C1, C2) = 0.If a cluster hides such a rank-0 partition inconsistent with its assigned rank, it is rejected.
Output
The method
findClusters()returns a mapMap<Set<Integer>, Integer> // cluster -> rank
where each
Set<Integer>indexes a disjoint indicator cluster and the associatedIntegeris the estimated cross-rank of that cluster.
50.4. Inputs and outputs
50.4.1. Inputs
TSC is constructed from:
List<Node> variablesCovarianceMatrix cov
Internally, the covariance is converted to a correlation matrix and the
sample size n is read from cov.getSampleSize().
50.4.2. Output
Map<Set<Integer>, Integer> clusterToRank
Each entry:
Key: a cluster (set of integer indices into the original variable list).
Value: the estimated cross-rank at the boundary with the rest of the variables. For single-latent clusters this is typically
1.
In the GUI and wrapper algorithms, these indices are converted back to variable names and used to define latent-indicator blocks.
50.5. Key parameters
These are set on the Tsc object before calling findClusters():
Parameter |
Method |
Meaning |
|---|---|---|
|
|
Significance level for Wilks rank tests. Lower values make rank increases/decreases harder to accept. |
|
|
Effective sample size. Use |
|
|
Maximum rank to consider. The search runs ranks |
|
|
Minimum extra indicators beyond |
|
|
If |
50.6. Practical guidance
Alpha scheduling. For large samples, consider an
n-dependentalpha(for examplealpha = 1 / log n) or an information-criterion-based cutoff to control false rank changes.Redundancy. Set
minRedundancy >= 1to avoid unstable clusters of size exactlyr + 1, which cannot be cross-checked internally.Effective sample size. If your covariance comes from:
weighted data,
time-series with autocorrelation, or
a pre-averaged dataset,
adjust
nEffaccordingly to avoid overly optimistic rank tests.NOLAC limitation. This implementation enforces disjoint clusters. If your theory requires indicators to belong to multiple factors, TSC in its current form is not appropriate.
50.7. Limitations
Designed for linear-Gaussian models; non-Gaussian data can still work but the Wilks-based rank test may be less well calibrated.
Assumes pure measurement and no overlapping clusters. Violation of these assumptions can fragment or merge clusters in undesirable ways.
Sensitive to sampling noise when:
sample sizes are small, or
true cross-ranks are close to the decision boundary.
In such cases, adjust alpha, nEff, and minRedundancy, and inspfect clusters
visually or via domain knowledge.
50.9. Summary
TSC is a rank-based latent cluster finder designed for linear-Gaussian models with pure indicators.
It partitions observed variables into non-overlapping latent clusters by searching for subsets whose cross-covariance blocks exhibit stable low rank. Using Wilks rank tests together with refinement and internal-rank checks, TSC identifies clusters that behave as if generated by one or more latent variables. The output is a set of disjoint indicator groups and their estimated cross-ranks — a useful foundation for downstream latent-structure algorithms such as FOFC, FTFC, GFFC, and BPC.