# BPC — Build Pure Clusters

**Type:** Latent cluster discovery (measurement model)  
**Output:** Disjoint clusters of observed variables interpreted as indicators of latent factors  
**Reference:** Silva, Scheines, Glymour, Spirtes (JMLR, 2006)

BPC (Build Pure Clusters) is a **tetrad-based latent clustering** procedure inspired by the JMLR paper by Silva et al. It searches for groups of observed variables that:

1. Are mutually dependent (correlated enough to plausibly share a latent cause), and
2. Satisfy **tetrad constraints** consistent with a single latent factor,
3. Have **at least three indicators** (the paper’s rule to avoid fragile latents).

The implementation in Tetrad follows the *spirit* of the original paper but uses a practical, deterministic set of rules for growing, merging, and purifying clusters.

---

## Basic Assumptions

BPC is intended for:

- **Continuous, approximately Gaussian indicators**
- **Linear latent factor models**
- **Pure clusters**: each observed variable ideally loads on a single latent (no cross-loadings)

It operates on a **covariance or correlation matrix**, not on raw data, and uses Wilks-based rank tests for tetrads plus Fisher-Z tests for pairwise dependence.

---

## High-Level Algorithm

BPC proceeds in several stages:

### 1. Build a dependence pattern

- Compute a **pairwise dependence mask** `canLink[i][j]` using Fisher-Z tests on correlations with a relatively **loose alpha**:
    - `canLink[i][j] = true` if variables i and j are significantly dependent.
- This acts as a **screen**: tetrad checks only consider variables that are pairwise dependent.

Result: a graph-like pattern of variables that can plausibly belong to the same latent cluster.

---

### 2. Enumerate tetrad seeds and grow local groups

- For every 4-tuple of variables that passes the dependence screen:
    - Test whether all 3 tetrads in that quartet correspond to **rank 1** (latent-factor compatible).
    - If so, that 4-tuple is a **pure seed**.
- For each pure seed, BPC **grows a locally maximal pure group**:
    - Repeatedly try adding any other variable x:
        - If the enlarged set still passes **all tetrad tests**, keep x.
    - Stop when no more variables can be added without breaking purity.

Important: at this stage **variables are not marked as “used”**. Different seeds can grow into overlapping candidate groups.

---

### 3. Global merging of compatible groups

From all locally grown candidate groups:

1. **Merge rule:**  
   Repeatedly consider pairs of groups A and B:
    - Let U = A ∪ B.
    - If U is still tetrad-pure, and its **average absolute correlation** does not drop too much compared to A and B (by more than a small threshold), then merge:
        - Replace A and B by U.
2. This merging step is applied iteratively until **no more compatible merges** are found.

This stage encourages **larger, more coherent clusters** when a single latent can plausibly explain them.

---

### 4. Resolve overlaps (variable assignment)

After merging, some variables may still appear in **multiple groups**. BPC resolves these overlaps globally:

- For each variable that belongs to more than one group:
    - Compute its **average absolute correlation** with the other variables in each group.
    - Optionally also look at how many tetrads involving that variable pass within the group.
    - Assign the variable to the **best-fitting group**:
        - Highest compatibility score (average absolute correlation plus tie-breaking by tetrads).
    - Remove it from all other groups.

At the end of this step, the groups become **disjoint**: each variable is assigned to at most one cluster.

---

### 5. Filter and finalize clusters

Finally, BPC enforces two conditions:

1. **Cluster size:**
    - Drop any cluster with **fewer than 3 indicators** (as in the JMLR paper: latents with fewer than 3 children are discarded).
2. **Purity check:**
    - Make sure each remaining group:
        - Has at least 3 variables, and
        - Either is too small to form tetrads (size 3), or
        - Still passes the tetrad purity checks if size ≥ 4.

The output is a list of **disjoint pure clusters**, each intended to represent one latent variable.

---

## Output and Interpretation

- The algorithm returns a list of clusters: each cluster is a set of observed variables believed to share a **single latent parent**.
- Clusters are **disjoint**: no observed variable belongs to more than one latent factor in the BPC solution.
- These clusters can be used to:
    - Build a **measurement model** (one latent per cluster), then
    - Run a structural discovery algorithm (e.g., PC, GFCI, BOSS-FCI) on the latents.

---

## Parameters in Tetrad

| Parameter | Description |
|-----------|-------------|
| `alpha` | Significance level for tetrad-based rank tests (purity checks). Smaller values demand stronger evidence for a latent structure. |
| `ess` | Effective sample size used in statistical tests; `-1` means use the actual sample size. |
| `verbose` | If true, logs details: seeds found, merges, overlap resolutions, and tetrad statistics. |

Some internal “knobs” are fixed to practical defaults in the current implementation (not exposed as GUI parameters), such as:

- `alphaPairs`: looser alpha for pairwise dependence screening,
- `deltaMerge`: small allowable drop in average absolute correlation when merging two groups.

---

## Strengths

- Follows the **Silva et al. (2006)** spirit:
    - Find **pure measurement clusters** using tetrads.
    - Apply **global purification** instead of greedy, once-through clustering.
- Allows flexible resolution of overlaps and merges, which can help in noisy or borderline cases.
- Works entirely from the covariance structure, without fitting full SEMs.

---

## Limitations

- Requires reasonably large sample sizes for reliable tetrad tests.
- Assumes approximately **Gaussian, linear, and pure** measurement relations.
- Will discard latent candidates with fewer than 3 indicators, even if they exist in the data.
- Not designed to model **cross-loadings**; those variables will be forced into a single cluster or may cause that cluster to be rejected.

---

## Reference

- Silva, R., Scheines, R., Glymour, C., & Spirtes, P. (2006). **Learning the structure of linear latent variable models.** *Journal of Machine Learning Research*, 7, 191–246.

## Summary

BPC builds latent clusters by:

1. Screening for pairwise dependence,
2. Finding tetrad-pure seeds,
3. Growing them to locally maximal groups,
4. Globally merging compatible groups,
5. Resolving overlaps by assigning each variable to its best-fitting group, and
6. Dropping any small or impure groups.

The result is a set of **pure, disjoint indicator clusters** that can be interpreted as latent factors and used as input to further causal structure learning.