# Data subset / resample

The **Data subset / resample** node creates a new dataset by selecting a subset of
variables and rows from an existing dataset, with optional random resampling.
It is a safer, reproducible alternative to using copy/paste and delete
operations in the data table.

```{figure} /_static/images/tetrad-interface/box-by-box/data-subset-editor.png
:name: data-subset-editor
:alt: Data subset / resample editor

Data subset / resample editor.
```

## Inputs and outputs

- **Input:** One rectangular `Data` node (continuous, discrete, or mixed).
- **Output:** A new `Data` node whose:
  - **Columns** are the selected variables, in the order shown in the *Selected variables* list.
  - **Rows** are the rows matching the row-specification and sampling options.

The original dataset is not modified.

---

## Variable selection

The top half of the editor controls which variables (columns) are included in the
output dataset and in what order.

### Available vs. Selected variables

- **Available variables (left):**  
  All variables from the input dataset that are *not* currently selected.
- **Selected variables (right):**  
  Variables that will appear in the output dataset, in the order shown.

Use the buttons between the lists to move variables:

- **`>`** – Move the highlighted variables from *Available* to *Selected*.
- **`<`** – Move the highlighted variables from *Selected* back to *Available*.
- **`>>`** – Move *all* variables from *Available* to *Selected*.
- **`<<`** – Move *all* variables from *Selected* back to *Available*.

If no variables are in the *Selected* list when you click **OK**, the node
defaults to “all variables in original order.”

### Ordering selected variables

The **Move Up** and **Move Down** buttons change the order of variables in the
*Selected variables* list:

- **Move Up** – Move the highlighted variable(s) one position up.
- **Move Down** – Move the highlighted variable(s) one position down.

The final column order in the output dataset exactly matches the order of the
*Selected variables* list.

### Sorting available variables

The **Sort** button underneath *Available variables* alphabetizes the left-hand
list (A–Z) by variable name. This only affects the display order of the
*Available* list; it does **not** change the order of the *Selected* variables
or the columns in the output dataset.

You can freely sort, select, and move variables without affecting the original
dataset.

### Paste… (select variables by name)

The **Paste…** button lets you select variables by pasting their names from an
external source (for example, a text file or script):

1. Click **Paste…**.
2. In the popup text area, paste variable names separated by commas, tabs,
   spaces, or newlines (for example:  
   `X1, X2, X3` or `X1 X2 X3` or one name per line).
3. Click **OK**.

Behavior:

- Any pasted names that exist in the dataset are moved into the
  *Selected variables* list, in the pasted order.
- Variables that were already selected are repositioned to match the pasted
  order.
- Variables not mentioned in the pasted list are left unchanged.
- If some pasted names are not present in the dataset, a small popup shows the
  list of missing names, which you can dismiss.

This is especially useful when you already have a curated list of variables in a
paper, script, or external file.

---

## Rows and sampling

The bottom half of the editor controls which rows are included, and how they are
sampled.

### Row specification

The **Rows** field accepts a comma-separated list of 1-based ranges:

- A single row: `10`
- A range of rows: `20-30`
- A combination: `1-100, 150, 200-250`

Semantics:

- Indices are **1-based** (row `1` is the first row in the dataset).
- Ranges are inclusive (e.g., `20-30` means rows 20 through 30).
- Whitespace around commas and dashes is ignored.

If the field is left **blank**, all rows of the dataset are used as the base
row set.

Invalid specifications (for example, `0-10`, `30-20`, or non-numeric text) will
produce an error dialog and fall back to using all rows.

### Sampling mode

The **Sampling mode** selector determines how rows are used:

- **Use rows as-is**  
  - Uses exactly the rows specified by the *Rows* field, in their original
    order.
  - Ignores the **Sample size** field.
- **Shuffle rows**  
  - Uses the same set of rows, but in random order.
  - The underlying row set is still determined by the *Rows* field.
- **Subsample (without replacement)**  
  - Randomly selects a subset of the specified rows, without replacement.
  - The **Sample size** must be between `1` and the number of available rows.
- **Bootstrap (with replacement)**  
  - Draws rows with replacement from the specified rows.
  - The **Sample size** controls the number of rows in the output dataset.

When a mode requires a sample size, the **Sample size** spinner becomes
editable; otherwise it is greyed out and defaults to the number of selected
rows.

### Seed (reproducibility)

The **Seed** field controls the random number generator used for shuffling,
subsampling, and bootstrapping:

- If the field is left **blank**, a fresh random seed is used each time.
- If you enter an integer (for example `40`), the sampling becomes reproducible:
  running the same node again with the same seed, row spec, and sampling mode
  will produce the same subset.

---

## Typical use cases

- **Create a clean variable subset**  
  Select a subset of variables (possibly in a new order), leave *Rows* blank,
  choose **Use rows as-is**, and click **OK** to get a new dataset with only
  those columns.

- **Extract a contiguous block of rows**  
  Enter `101-200` in *Rows*, leave sampling as **Use rows as-is**, and select
  variables as needed.

- **Draw a bootstrap sample of a subset**  
  Enter a row range (or leave blank for all rows), choose **Bootstrap**, set the
  **Sample size** and **Seed**, and click **OK** to create a reproducible
  bootstrap dataset over the selected variables.

The resulting node can be used anywhere a normal `Data` node can be used
(e.g., as input to search, estimation, or plotting procedures).