> ## Documentation Index
> Fetch the complete documentation index at: https://docs.synthesize.bio/llms.txt
> Use this file to discover all available pages before exploring further.

# Reference conditioning

> Generate expression data conditioned on a real reference sample.

## Overview

Reference conditioning models generate expression data conditioned on a real reference sample. This lets you anchor to an existing expression profile while applying perturbations or modifications.

This is useful when you want to:

* Simulate the effect of a perturbation on a specific sample
* Generate expression profiles that preserve the biological and technical characteristics of a reference
* Create synthetic treated versus control pairs

## Available models

* **`gem-1-bulk_reference-conditioning`**: Bulk RNA-seq reference conditioning model
* **`gem-1-sc_reference-conditioning`**: Single-cell RNA-seq reference conditioning model

<Note>
  These endpoints may require 1 to 2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.
</Note>

```python theme={null}
import pysynthbio
```

## How it works

Reference conditioning encodes the biological and technical characteristics from a real expression sample, then generates new expression data that:

1. Preserves the biological and technical latent space of the reference
2. Applies any perturbation metadata you specify
3. Returns synthetic expression that reflects the perturbation effect on that specific sample

## Creating a query

Reference conditioning queries require different inputs than baseline models:

```python theme={null}
example_query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"]
print(example_query)
```

The query structure includes:

1. **`inputs`**: A list where each input contains:
   * **`counts`**: The reference expression counts
   * **`metadata`**: Perturbation-only metadata
   * **`num_samples`**: How many samples to generate
2. **`conditioning`**: Which latent spaces to condition on, typically `["biological", "technical"]`
3. **`sampling_strategy`**: `"mean estimation"` or `"sample generation"`

### Perturbation-only metadata

Unlike baseline models, reference conditioning queries only accept perturbation metadata fields:

* `perturbation_ontology_id`
* `perturbation_type`
* `perturbation_time`
* `perturbation_dose`

All other biological and technical metadata is inferred from the reference expression.

## Example: simulating a drug treatment

A complete example simulating a drug treatment effect on a reference sample:

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"]

# Replace with your actual reference counts.
# The counts list must match the model's expected gene order and length.
query["inputs"][0]["counts"] = your_reference_counts

# Specify the perturbation
query["inputs"][0]["metadata"] = {
    "perturbation_ontology_id": "CHEMBL25",  # Aspirin (ChEMBL ID)
    "perturbation_type": "compound",
    "perturbation_time": "24 hours",
    "perturbation_dose": "10 um",
}

query["inputs"][0]["num_samples"] = 3
query["sampling_strategy"] = "mean estimation"

result = pysynthbio.predict_query(query, model_id="gem-1-bulk_reference-conditioning")
```

## Example: CRISPR knockout simulation

Simulate the effect of knocking out a specific gene:

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"]

# Your reference sample counts
query["inputs"][0]["counts"] = control_sample_counts

# CRISPR knockout of TP53
query["inputs"][0]["metadata"] = {
    "perturbation_ontology_id": "ENSG00000141510",  # TP53 Ensembl ID
    "perturbation_type": "crispr",
}

query["inputs"][0]["num_samples"] = 5

result = pysynthbio.predict_query(query, model_id="gem-1-bulk_reference-conditioning")
```

## Query parameters

### `conditioning` (list, optional)

Controls which latent spaces are conditioned on the reference. Default is `["biological", "technical"]`.

When both are conditioned, the model preserves both biological identity and technical characteristics from the reference sample.

### `sampling_strategy` (str, required)

Controls the type of prediction:

* **`"sample generation"`**: Generates realistic-looking synthetic data with measurement error. **Bulk only**
* **`"mean estimation"`**: Provides stable mean estimates. **Bulk and single-cell**

```python theme={null}
query["sampling_strategy"] = "mean estimation"
```

### `fixed_total_count` (bool, optional)

Controls whether to preserve the reference's library size:

* **`False`** (default): The output's total count is taken from the reference expression sum
* **`True`**: Forces the model to use the `total_count` parameter value or default instead of the reference's library size

```python theme={null}
# Preserve reference library size (default)
query["fixed_total_count"] = False

# Or force a specific library size
query["fixed_total_count"] = True
query["total_count"] = 10_000_000
```

### `total_count` (int, optional)

Library size used when converting predicted log CPM back to raw counts. Only effective when `fixed_total_count = True`.

* Default: 10,000,000 for bulk; 10,000 for single-cell

### `deterministic_latents` (bool, optional)

If `True`, the model uses the mean of each latent distribution instead of sampling. This produces deterministic, reproducible outputs.

* Default: `False`

```python theme={null}
query["deterministic_latents"] = True
```

### `seed` (int, optional)

Random seed for reproducibility.

```python theme={null}
query["seed"] = 42
```

## Valid perturbation metadata

| Field                      | Description / format                                                                                                                                                  |
| -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `perturbation_ontology_id` | Ensembl gene ID, [ChEBI ID](https://www.ebi.ac.uk/chebi/), [ChEMBL ID](https://www.ebi.ac.uk/chembl/), or [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy)   |
| `perturbation_type`        | One of `"coculture"`, `"compound"`, `"control"`, `"crispr"`, `"genetic"`, `"infection"`, `"other"`, `"overexpression"`, `"peptide or biologic"`, `"shrna"`, `"sirna"` |
| `perturbation_time`        | Time since perturbation as a number and unit separated by a space, such as `"24 hours"`                                                                               |
| `perturbation_dose`        | Dose as a number and unit separated by a space, such as `"10 um"` or `"1 mg/kg"`                                                                                      |

## Working with results

The result structure is similar to baseline models:

```python theme={null}
metadata = result["metadata"]
expression = result["expression"]

print(expression.shape)
print(metadata.head())
```

### Differential expression

When conditioning on both biological and technical latents, you can directly compare the generated expression to your reference to identify perturbation effects:

```python theme={null}
import numpy as np

# Your reference (input) counts
reference_cpm = your_reference_counts / np.sum(your_reference_counts) * 1e6

# Generated (perturbed) counts
generated_counts = expression.iloc[0].values
generated_cpm = generated_counts / np.sum(generated_counts) * 1e6

# Log fold change
log2fc = np.log2(generated_cpm + 1) - np.log2(reference_cpm + 1)

# Identify top changed genes
gene_names = expression.columns
top_indices = np.argsort(log2fc)[-20:]
print("Top upregulated genes:", gene_names[top_indices].tolist())
```

## Important notes

### Counts vector length

The reference counts vector must match the model's expected number of genes. If the length does not match, the API returns a validation error.

Use `get_example_query()` to see the expected structure and ensure your counts vector has the correct length.

### Gene order

Ensure your reference counts are in the same gene order expected by the model. The response includes a `gene_order` field that specifies the expected order.