50. TSC — Trek Separation Clusters

Category: Latent clusters / measurement structure
Code: edu.cmu.tetrad.search.Tsc

TSC (“Trek Separation Clusters”) is a rank-based latent cluster finder. Given a covariance (or correlation) matrix over observed variables, it searches for disjoint groups of indicators that behave as if they share one or more latent parents.

The implementation in Tetrad is a NOLAC version (“no overlapping clusters”): each observed variable is assigned to at most one cluster.


50.1. Intended use

Use TSC when:

  • You believe your data come (approximately) from a linear-Gaussian latent variable model with pure indicators.

  • Each observed variable should load on one latent at most (no overlapping measurement clusters).

  • You want an automatic partition of variables into clusters that can then be fed into higher-level latent-structure algorithms (e.g., FOFC/FTFC/GFFC/BPC or manual SEM modeling).

TSC works directly on a covariance matrix (internally converted to a correlation matrix) and is therefore agnostic to the original sample size and data format, provided you give a correct effective sample size.


50.2. Model assumptions (NOLAC version)

TSC is designed to be (asymptotically) sound under the following stylized assumptions:

  1. Linear-Gaussian SEM with a latent DAG

    • Latent variables form a DAG.

    • Observed indicators are linear functions of their latent parents plus independent Gaussian noise.

  2. Pure measurement within clusters

    • Each observed variable has exactly one latent parent (within the latent structure you care about).

    • Unique errors of indicators belonging to different clusters are independent.

  3. No overlapping clusters (NOLAC)

    • Each observed variable belongs to at most one true cluster.

  4. Generic parameters

    • There are no exact parameter cancellations that artificially change matrix ranks.

  5. Consistent rank test

    • Cross-block matrix rank is tested using Wilks’ likelihood-ratio test (or a similar rank test), with a significance level alpha that decreases with n (for example, alpha = 1 / log n) or with an information-criterion cutoff.

Under these assumptions, the algorithm aims to recover, asymptotically, the true indicator clusters and their “boundary ranks” (typically 1 for a single latent).


50.3. High-level algorithm sketch

Let V be the index set of all observed variables, and let Sigma be the correlation matrix. For a candidate cluster C V, define D = V \ C.
TSC examines the rank of the cross-covariance block Sigma[C, D].

Very roughly:

  1. Base clusters by rank

    For each rank r = 0, …, rMax:

    • Let the base size be |C| = r + 1.

    • Enumerate all subsets C of that size.

    • Keep those where

      rank(Sigma[C, D]) = r

      according to a Wilks rank test with parameters (nEff, alpha).

  2. Seed and grow clusters (union / extension)

    • Starting from unused base subsets at rank r, pick a seed cluster.

    • Try to union it with other overlapping base subsets at the same rank.

    • Accept a union C' = C C_candidate when:

      • rank(Sigma[C', V \ C']) is still r, and

      • |C'| >= r + 1 + minRedundancy.

    • This iteratively produces maximal rank-r clusters that do not overlap previously committed variables.

  3. Bifactor-style augmentation

    • For each new cluster C1, look for a nearby variable that, when added, causes a one-step rank drop:

      • rank(Sigma[C1, V \ C1]) = r

      • rank(Sigma[C2, V \ C2]) = r - 1, where C2 = C1 {x} and
        |C2| = |C1| + 1.

    • This step is designed to capture bifactor-like or multi-dimensional boundaries in which adding a “spanning” indicator reduces cross-rank by 1.

    • Candidate augmentations that create internal rank-0 splits are rejected (see step 5).

  4. Conditional-rank refinement (Rule 3)

    For each candidate cluster C of intended rank r_C:

    • Let D = V \ C.

    • Search for subsets Z C with |Z| >= r_C such that

      rank(C \ Z, D | Z) = 0.

    • If such a Z is found, it is treated as an observed bottleneck or mediator; the variables in Z are removed from C.

    • Repeat until no such Z remains or the cluster becomes too small.

    Intuition: in a purely observed DAG (no latents), a mediator can d-separate C \ Z from D when conditioned on. This pattern should not persist for a genuine latent cluster under the measurement assumptions above.

  5. Internal rank-0 check

    Finally, for each refined cluster C:

    • Examine nontrivial splits C = C1 C2 and test

      rank(C1, C2) = 0.

    • If a cluster hides such a rank-0 partition inconsistent with its assigned rank, it is rejected.

  6. Output

    • The method findClusters() returns a map

      Map<Set<Integer>, Integer>  // cluster -> rank
      

      where each Set<Integer> indexes a disjoint indicator cluster and the associated Integer is the estimated cross-rank of that cluster.


50.4. Inputs and outputs

50.4.1. Inputs

TSC is constructed from:

  • List<Node> variables

  • CovarianceMatrix cov

Internally, the covariance is converted to a correlation matrix and the sample size n is read from cov.getSampleSize().

50.4.2. Output

  • Map<Set<Integer>, Integer> clusterToRank

Each entry:

  • Key: a cluster (set of integer indices into the original variable list).

  • Value: the estimated cross-rank at the boundary with the rest of the variables. For single-latent clusters this is typically 1.

In the GUI and wrapper algorithms, these indices are converted back to variable names and used to define latent-indicator blocks.


50.5. Key parameters

These are set on the Tsc object before calling findClusters():

Parameter

Method

Meaning

alpha

setAlpha(double)

Significance level for Wilks rank tests. Lower values make rank increases/decreases harder to accept.

nEff

setEffectiveSampleSize(int)

Effective sample size. Use -1 (default) to fall back to cov.getSampleSize(). Important if your covariance comes from a transformed or weighted sample.

rMax

setRmax(int)

Maximum rank to consider. The search runs ranks 0 through rMax. Defaults to 3.

minRedundancy

setMinRedundancy(int)

Minimum extra indicators beyond r + 1 required for a cluster to be accepted. Clusters must satisfy `

verbose

setVerbose(boolean)

If true, logs detailed progress and rank decisions via TetradLogger.


50.6. Practical guidance

  • Alpha scheduling. For large samples, consider an n-dependent alpha (for example alpha = 1 / log n) or an information-criterion-based cutoff to control false rank changes.

  • Redundancy. Set minRedundancy >= 1 to avoid unstable clusters of size exactly r + 1, which cannot be cross-checked internally.

  • Effective sample size. If your covariance comes from:

    • weighted data,

    • time-series with autocorrelation, or

    • a pre-averaged dataset,

    adjust nEff accordingly to avoid overly optimistic rank tests.

  • NOLAC limitation. This implementation enforces disjoint clusters. If your theory requires indicators to belong to multiple factors, TSC in its current form is not appropriate.


50.7. Limitations

  • Designed for linear-Gaussian models; non-Gaussian data can still work but the Wilks-based rank test may be less well calibrated.

  • Assumes pure measurement and no overlapping clusters. Violation of these assumptions can fragment or merge clusters in undesirable ways.

  • Sensitive to sampling noise when:

    • sample sizes are small, or

    • true cross-ranks are close to the decision boundary.

In such cases, adjust alpha, nEff, and minRedundancy, and inspfect clusters visually or via domain knowledge.


50.9. Summary

  • TSC is a rank-based latent cluster finder designed for linear-Gaussian models with pure indicators.

  • It partitions observed variables into non-overlapping latent clusters by searching for subsets whose cross-covariance blocks exhibit stable low rank. Using Wilks rank tests together with refinement and internal-rank checks, TSC identifies clusters that behave as if generated by one or more latent variables. The output is a set of disjoint indicator groups and their estimated cross-ranks — a useful foundation for downstream latent-structure algorithms such as FOFC, FTFC, GFFC, and BPC.