Algorithm Selection and Assumptions
After exploring your data (see Data Exploration), the next step in the causal analysis workflow is to choose appropriate discovery methods and clarify the assumptions you are willing to make.
There is no single “best” algorithm for all problems. Different methods rely on different assumptions, emphasize different aspects of the data, and produce different kinds of output. The goal at this stage is to make reasonable starting choices, not final commitments.
What This Page Covers
This page is intended to help you:
Identify the assumptions that matter most for causal discovery
Understand how those assumptions map to families of algorithms
Choose sensible starting methods
Recognize when and how to revise choices using diagnostics and Grid Search
1. Which Assumptions Matter?
Causal discovery uses statistical patterns to infer aspects of causal structure. Because different algorithms depend on different assumptions, it is useful to be explicit—at least provisionally—about what you are assuming.
1.1. Causal Sufficiency
Question: Do you believe that all relevant common causes of your measured variables have been observed?
Yes: You may assume causal sufficiency and search for a DAG or CPDAG.
No or Unsure: You should allow for latent confounders and search for a PAG.
Causal sufficiency is a strong assumption.
If it does not hold, methods that rely on it can yield misleading orientations.
When in doubt, allowing for latent variables is often the more cautious starting point.
1.2. Functional Form and Distribution
Question: Are relationships roughly linear and Gaussian?
Yes: Linear methods (e.g., PC with Fisher-Z, FGES/BIC) are often effective.
No or Unclear: Nonparametric or rank-based methods may be more appropriate, though they can be slower or require larger samples.
Visual inspection during the Data Exploration step is often more informative than formal tests at this stage.
1.3. Modeling Goal
Question: What do you want to learn from the analysis?
Adjacency structure: Which variables are connected.
Partial orientation: Some causal directions where identifiable.
Full orientation or effect estimation: Requires stronger assumptions and more careful interpretation.
Different goals justify different levels of modeling complexity and diagnostic scrutiny.
1.4. Sample Size and Dimensionality
Small sample sizes: Favor simpler models and more conservative assumptions.
Large samples: Support more flexible tests and scores.
Many variables: Regularization and sparsity penalties become increasingly important.
These considerations affect both the stability and interpretability of results.
2. Major Algorithm Families in Tetrad
Tetrad includes several families of causal discovery methods. The purpose here is not to catalog them exhaustively, but to suggest reasonable entry points.
2.1. Constraint-Based Methods
Examples: PC, FCI
Core idea: Use conditional independence tests to remove and orient edges.
These methods are often useful when:
An appropriate independence test can be chosen for the data.
You want models that closely reflect observed conditional independencies.
Typical choices include:
PC when assuming causal sufficiency.
FCI when allowing for latent confounders and selection bias.
Notes:
The significance level (α) controls sparsity.
Smaller α values generally produce sparser graphs.
2.2. Score-Based Methods
Examples: FGES, BOSS, GRaSP
Core idea: Optimize a global score that balances goodness-of-fit and model complexity.
These methods are often useful when:
A global objective is preferred over local independence decisions.
Suitable scores exist for the data type.
Notes:
Scores naturally penalize overly complex graphs.
Results can be sensitive to the choice of penalty strength.
2.3. Hybrid Methods
Examples: BOSS-FCI, GRaSP-FCI, FCIT
Hybrid methods combine ideas from both constraint-based and score-based approaches.
They typically use a score-based procedure to propose or refine adjacencies, followed by
constraint-based reasoning to orient edges and account for conditional independences.
Hybrid approaches can be useful when:
Purely score-based methods are unavailable or perform poorly for the data type
Pure constraint-based methods become unstable due to large conditioning sets
Latent variables or selection effects must be accommodated alongside global scoring
These methods often provide a practical middle ground, leveraging the strengths of both paradigms while mitigating some of their individual weaknesses.
2.4 Time Series Data (Lagged Variables)
If your data consist of repeated measurements over time, Tetrad can handle this by converting the data into a time-lagged representation. Each variable is expanded into multiple lagged copies (for example, (X_t, X_{t-1}, X_{t-2})), and tiered background knowledge is automatically created to prevent future variables from causing past variables.
Once the data are represented in this form, the same considerations discussed on this page apply:
causal sufficiency vs. latent variables,
choice of tests or scores,
constraint-based, score-based, or hybrid methods.
Tetrad provides a data manipulation tool to create time-lagged datasets, and this tool automatically generates the corresponding tiered background knowledge needed for time-lagged causal searches. Algorithm selection then proceeds exactly as for cross-sectional data.
3. Mapping Assumptions to Starting Choices
The table below summarizes common starting points:
Assumptions / Goals |
Suggested Methods |
Notes |
|---|---|---|
Causal sufficiency, linear-Gaussian |
PC, FGES, BOSS, GRaSP |
Standard baseline |
Latents possible, linear-Gaussian |
FCI, BOSS-FCI, GRaSP-FCI |
Produces PAGs |
Nonlinear or non-Gaussian |
PC/FCI with nonparametric tests |
More flexible, often slower |
Mixed data types |
Mixed tests or score-based methods |
Check compatibility carefully |
High-dimensional data |
Penalized scores, hybrid methods |
Regularization is helpful |
This table is meant as guidance rather than prescription.
Grid Search is designed to let you evaluate alternatives directly.
4. Choosing Tests and Scores
For detailed reference information, see the Tests and Scores section of the manual.
4.1. Independence Tests
Fisher-Z: Continuous, approximately Gaussian data.
Rank-based tests: More robust to non-Gaussianity and outliers.
Discrete tests: Fully discrete data.
When unsure, conservative choices combined with diagnostics are often a reasonable starting point.
4.2. Scores
BIC / SEM-BIC: Common choices for continuous data.
Discrete BIC / AIC: Used in discrete or mixed settings.
Score choice affects both sparsity and sensitivity to noise.
5. What If You’re Unsure?
Uncertainty is normal—and expected.
A common and productive strategy is to:
Start with conservative assumptions (e.g., allow latent variables).
Run a simple search.
Use diagnostics (especially the Markov Checker).
Revise assumptions, parameters, or methods.
Repeat.
This search → diagnose → revise cycle is central to the causal analysis workflow.
6. Using Grid Search Effectively
Grid Search allows you to:
Compare multiple algorithms
Sweep over key parameters (e.g., α levels, penalty discounts)
Evaluate models using diagnostics
Identify minimal models that pass Markov checks
Rather than committing to a single method, Grid Search helps clarify which assumptions and settings are supported by the data.
7. Summary
Algorithm selection in Tetrad is less about finding the “right” method upfront and more about:
Making assumptions explicit
Choosing sensible starting points
Evaluating outputs systematically
Refining choices based on evidence
This approach tends to produce results that are more interpretable and defensible.
🧭 Next Step
Proceed to Workflow: Running Searches and Grid Search Tips to learn how to run searches systematically and evaluate them using Grid Search.