# CStaR (Causal Stability Ranking)

**CStaR** (Stekhoven et al., 2012) is a *ranking* method rather than a pure structure-learning algorithm.  
Given a set of **possible causes** and **possible effects**, it repeatedly:

1. Subsamples the data.
2. Learns a **CPDAG** on that subsample.
3. Uses **IDA** on that CPDAG to compute a *minimum total effect* for each candidate cause–effect pair.
4. Records which pairs are among the “top” strongest effects in that subsample.

Over many subsamples, CStaR estimates for each edge \(X \to Y\):

- how **often** \(X\) behaves like a cause of \(Y\) (`π`), and
- how **large** the effect tends to be (`minBeta` / “Effect” column),

and then uses **stability selection** ideas (Meinshausen & Bühlmann, 2010) to bound the expected number of false positives.

It is especially useful when you care about *prioritizing* a small set of robust, high-confidence effects (e.g., candidate causal predictors of a biological or clinical outcome) rather than recovering the entire causal graph.

---

## High-level idea

For each effect variable \(Y\) and each candidate cause \(X\):

1. **Subsample the data**
    - Draw a half-sample (with or without replacement, depending on the chosen sampling style).

2. **Learn a CPDAG on the subsample**
    - Use one of several CPDAG-producing algorithms:
        - PC-Stable
        - FGES
        - BOSS
        - Restricted BOSS

3. **Run IDA on the CPDAG**
    - For each candidate effect \(Y\), CStaR runs IDA to compute the **minimum total effect** of each possible cause \(X\) on \(Y\) across all DAGs in the CPDAG equivalence class.
    - This produces an effects matrix for that subsample: one effect size per (cause, effect) pair.

4. **Select the strongest effects in that subsample**
    - For each subsample, CStaR sorts all cause–effect effects and identifies a **“top bracket”** of strongest effects (size = `topBracket × #effects`).
    - Any pair whose effect lies in that top bracket is regarded as “selected” in that subsample.

5. **Aggregate across subsamples**
    - Over all subsamples:
        - `π` = proportion of subsamples in which \(X \to Y\) falls into the top bracket.
        - `Effect` = average of the minimal total effects from IDA across subsamples.

6. **Rank and filter**
    - Pairs are ranked primarily by `π` (more stable first), then by effect size.
    - Pairs with effect size below `selectionAlpha` are discarded.
    - A **PCER** (Per-Comparison Error Rate) is reported using the stability-selection bound.

The final output is a ranked **table of candidate causal edges**, with stability and effect-size information, and a simple graph view that keeps the most stable edges.

---

## Inputs

CStaR requires:

- A **continuous data set** (or at least, data for which the chosen score and test are appropriate).
- A set of **possible causes** (predictor variables).
- A set of **possible effects** (outcome variables) — often one or a small number of “targets” of interest.
- Choices for:
    - **CPDAG algorithm** (PC-Stable, FGES, BOSS, Restricted BOSS)
    - **Sampling style** (bootstrap or subsample)
    - **Number of subsamples**
    - **Top bracket size** (`q`)
    - **Selection threshold** (`selectionAlpha`)

Background knowledge about forbidden/required edges is **not** currently used; CStaR relies purely on the chosen CPDAG algorithm.

---

## Outputs

CStaR produces:

1. **A ranked table of records**

   Each row corresponds to a candidate edge \(X \to Y\) and includes:

    - `Cause` – the candidate predictor \(X\).
    - `Effect` – the target \(Y\).
    - `PI` – the stability frequency \( \hat{\pi} \) (fraction of subsamples where \(X \to Y\) lies in the top bracket).
    - `Effect` – the average minimal IDA effect for \(X \to Y\) across subsamples.
    - `PCER` – an estimated *per-comparison error rate* bound based on Meinshausen–Bühlmann stability selection; for edges with low stability (π ≤ 0.5), `PCER` is replaced by `*` to flag them as below the reliable range.
    - `#Potential causes` and `#Potential effects` – the sizes of the candidate sets used to compute the table.

2. **A graph view (optional)**

   CStaR can be used to construct a graph where:

    - Nodes are the variables that appear in the records.
    - A directed edge \(X \to Y\) is drawn when `π > 0.5`.

   This graph highlights **highly stable candidate causal relations** but is *not* meant as a full causal discovery result; it is a visualization of the top-ranked edges.

3. **Optional intermediate files**

   For reproducibility and resumability, CStaR can write:

    - The subsampled data sets,
    - The CPDAGs fitted on each subsample, and
    - The matrices of IDA effects per subsample.

   If rerun with the same output directory, CStaR will reload existing intermediate results instead of recomputing them.

---

## Parameters

| Parameter (camelCase)        | Description |
|------------------------------|-------------|
| `selectionMinEffect`         | Non-negative double. Minimum absolute effect size required for a variable to be considered statistically relevant during stability selection. Smaller values make selection more permissive; larger values make it conservative. |
| `numSubsamples`              | Integer ≥ 1. Number of subsamples (bootstrap or subsample splits) to use for stability scoring. Higher values give more stable results but increase computation. Typical range: 20–200. |
| `targets`                    | List of variable names. Restricts CStaR to estimating the parent sets only for the specified target variables. If empty, CStaR analyzes all variables. |
| `topBracket`                 | Integer ≥ 1. Number of top-ranked candidate graphs (or parent sets) retained per subsample before voting. Controls model diversity and stability. |
| `parallelized`               | Boolean. If `true`, processes subsamples in parallel across multiple threads. Strongly recommended for large datasets. |
| `cstarCpdagAlgorithm`        | String. The algorithm used to convert the aggregated results into a CPDAG (e.g., `"PC"`, `"GFCI"`, `"FGES"`). Determines how CStaR interprets the final graph structure. |
| `fileOutPath`                | String path. If non-empty, results (e.g., subsample graphs, selection frequencies) are written to disk at the given location. Useful for large studies or reproducibility. |
| `removeEffectNodes`          | Boolean. If `true`, nodes that never meet the minimum effect threshold across subsamples are excluded before final aggregation. |
| `sampleStyle`                | String. Controls how subsamples are constructed (e.g., `"bootstrap"`, `"half-sample"`, `"cross-validation"`). Affects stability and runtime. |
| `verbose`                    | Boolean. If `true`, prints detailed progress information during subsampling, scoring, and aggregation. |

---

### Interpreting the table

For a given row \(X \to Y\):

- **PI (stability)**
    - Close to 1.0: \(X \to Y\) consistently appears among the strongest effects across subsamples.
    - Around 0.5: borderline; may be interesting but less robust.
    - Close to 0: rarely selected; often noise.

- **Effect (average minimal effect)**
    - Positive values indicate a consistent positive causal effect estimate.
    - Larger magnitude suggests a stronger effect, but interpretation depends on scale and model assumptions.

- **PCER**
    - Gives an upper bound on the expected per-comparison error rate for that edge, given the overall selection procedure.
    - Edges with `*` (π ≤ 0.5) are not in the reliable regime of the bound and should be treated cautiously.

A typical use is to pick a **PI threshold** (e.g. `π ≥ 0.8`) and an **effect threshold** (e.g. `Effect ≥ 0.1`), and then focus on that shortlist as **candidate causal predictors** for follow-up analysis or experiments.

---

## When to use CStaR

CStaR is most useful when:

- You have **many potential predictors** and a smaller number of key outcomes, and you want a **prioritized list** of robust causal candidates.
- You are worried about **model-selection instability**: different subsamples might suggest different graphs, and you want edges that “survive” this variability.
- You care about **controlling false positives** in a stability-selection sense, rather than recovering a single “best” graph.

It pairs naturally with workflows where:

- The **full causal graph** is complex or high-dimensional, but
- You mainly need a **short, interpretable list** of predictors that are repeatedly supported by the data across resamples and CPDAG variations.

## References

Stekhoven, D. J., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M. H., & Bühlmann, P. (2012).  
**Causal stability ranking.** *Bioinformatics*, 28(21), 2819–2823.

Meinshausen, N., & Bühlmann, P. (2010).  
**Stability selection.** *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 72(4), 417–473.

Colombo, D., & Maathuis, M. H. (2014).  
**Order-independent constraint-based causal structure learning.** *Journal of Machine Learning Research*, 15(1), 3741–3782.

## Summary

- CStaR is a stability-based causal ranking method that repeatedly subsamples the data, fits a CPDAG, and applies IDA to estimate minimal total effects for each candidate cause–effect pair. It aggregates these results using stability selection, producing a ranked list of robust causal candidates with interpretable stability frequencies and effect sizes.

- CStaR is ideal when the goal is prioritizing reliable causal predictors rather than recovering a full graph, especially in high-dimensional settings where model-selection variability is high. It supports multiple CPDAG learners (PC-Stable, FGES, BOSS, RBOSS), parallelization, and reproducible output, but does not currently incorporate background-knowledge constraints.

⸻