Reference conditioning - Synthesize Bio

Overview

Reference conditioning models generate expression data conditioned on a real reference sample. This lets you anchor to an existing expression profile while applying perturbations or modifications. This is useful when you want to:

Simulate the effect of a perturbation on a specific sample
Generate expression profiles that preserve the biological and technical characteristics of a reference
Create synthetic treated versus control pairs

Available models

gem-1-bulk_reference-conditioning: Bulk RNA-seq reference conditioning model
gem-1-sc_reference-conditioning: Single-cell RNA-seq reference conditioning model

These endpoints may require 1 to 2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.

import pysynthbio

How it works

Reference conditioning encodes the biological and technical characteristics from a real expression sample, then generates new expression data that:

Preserves the biological and technical latent space of the reference
Applies any perturbation metadata you specify
Returns synthetic expression that reflects the perturbation effect on that specific sample

Creating a query

Reference conditioning queries require different inputs than baseline models:

example_query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"]
print(example_query)

The query structure includes:

inputs: A list where each input contains:
- counts: The reference expression counts
- metadata: Perturbation-only metadata
- num_samples: How many samples to generate
conditioning: Which latent spaces to condition on, typically ["biological", "technical"]
sampling_strategy: "mean estimation" or "sample generation"

Perturbation-only metadata

Unlike baseline models, reference conditioning queries only accept perturbation metadata fields:

perturbation_ontology_id
perturbation_type
perturbation_time
perturbation_dose

All other biological and technical metadata is inferred from the reference expression.

Example: simulating a drug treatment

A complete example simulating a drug treatment effect on a reference sample:

query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"]

# Replace with your actual reference counts.
# The counts list must match the model's expected gene order and length.
query["inputs"][0]["counts"] = your_reference_counts

# Specify the perturbation
query["inputs"][0]["metadata"] = {
    "perturbation_ontology_id": "CHEMBL25",  # Aspirin (ChEMBL ID)
    "perturbation_type": "compound",
    "perturbation_time": "24 hours",
    "perturbation_dose": "10 um",
}

query["inputs"][0]["num_samples"] = 3
query["sampling_strategy"] = "mean estimation"

result = pysynthbio.predict_query(query, model_id="gem-1-bulk_reference-conditioning")

Example: CRISPR knockout simulation

Simulate the effect of knocking out a specific gene:

query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"]

# Your reference sample counts
query["inputs"][0]["counts"] = control_sample_counts

# CRISPR knockout of TP53
query["inputs"][0]["metadata"] = {
    "perturbation_ontology_id": "ENSG00000141510",  # TP53 Ensembl ID
    "perturbation_type": "crispr",
}

query["inputs"][0]["num_samples"] = 5

result = pysynthbio.predict_query(query, model_id="gem-1-bulk_reference-conditioning")

Query parameters

`conditioning` (list, optional)

Controls which latent spaces are conditioned on the reference. Default is ["biological", "technical"]. When both are conditioned, the model preserves both biological identity and technical characteristics from the reference sample.

`sampling_strategy` (str, required)

Controls the type of prediction:

"sample generation": Generates realistic-looking synthetic data with measurement error. Bulk only
"mean estimation": Provides stable mean estimates. Bulk and single-cell

query["sampling_strategy"] = "mean estimation"

`fixed_total_count` (bool, optional)

Controls whether to preserve the reference’s library size:

False (default): The output’s total count is taken from the reference expression sum
True: Forces the model to use the total_count parameter value or default instead of the reference’s library size

# Preserve reference library size (default)
query["fixed_total_count"] = False

# Or force a specific library size
query["fixed_total_count"] = True
query["total_count"] = 10_000_000

`total_count` (int, optional)

Library size used when converting predicted log CPM back to raw counts. Only effective when fixed_total_count = True.

Default: 10,000,000 for bulk; 10,000 for single-cell

`deterministic_latents` (bool, optional)

If True, the model uses the mean of each latent distribution instead of sampling. This produces deterministic, reproducible outputs.

Default: False

query["deterministic_latents"] = True

`seed` (int, optional)

Random seed for reproducibility.

query["seed"] = 42

Valid perturbation metadata

Field	Description / format
`perturbation_ontology_id`	Ensembl gene ID, ChEBI ID, ChEMBL ID, or NCBI Taxonomy ID
`perturbation_type`	One of `"coculture"`, `"compound"`, `"control"`, `"crispr"`, `"genetic"`, `"infection"`, `"other"`, `"overexpression"`, `"peptide or biologic"`, `"shrna"`, `"sirna"`
`perturbation_time`	Time since perturbation as a number and unit separated by a space, such as `"24 hours"`
`perturbation_dose`	Dose as a number and unit separated by a space, such as `"10 um"` or `"1 mg/kg"`

Working with results

The result structure is similar to baseline models:

metadata = result["metadata"]
expression = result["expression"]

print(expression.shape)
print(metadata.head())

Differential expression

When conditioning on both biological and technical latents, you can directly compare the generated expression to your reference to identify perturbation effects:

import numpy as np

# Your reference (input) counts
reference_cpm = your_reference_counts / np.sum(your_reference_counts) * 1e6

# Generated (perturbed) counts
generated_counts = expression.iloc[0].values
generated_cpm = generated_counts / np.sum(generated_counts) * 1e6

# Log fold change
log2fc = np.log2(generated_cpm + 1) - np.log2(reference_cpm + 1)

# Identify top changed genes
gene_names = expression.columns
top_indices = np.argsort(log2fc)[-20:]
print("Top upregulated genes:", gene_names[top_indices].tolist())

Important notes

Counts vector length

The reference counts vector must match the model’s expected number of genes. If the length does not match, the API returns a validation error. Use get_example_query() to see the expected structure and ensure your counts vector has the correct length.

Gene order

Ensure your reference counts are in the same gene order expected by the model. The response includes a gene_order field that specifies the expected order.

Models

Documentation Index

​Overview

​Available models

​How it works

​Creating a query

​Perturbation-only metadata

​Example: simulating a drug treatment

​Example: CRISPR knockout simulation

​Query parameters

​conditioning (list, optional)

​sampling_strategy (str, required)

​fixed_total_count (bool, optional)

​total_count (int, optional)

​deterministic_latents (bool, optional)

​seed (int, optional)

​Valid perturbation metadata

​Working with results

​Differential expression

​Important notes

​Counts vector length

​Gene order

Overview

Available models

How it works

Creating a query

Perturbation-only metadata

Example: simulating a drug treatment

Example: CRISPR knockout simulation

Query parameters

`conditioning` (list, optional)

`sampling_strategy` (str, required)

`fixed_total_count` (bool, optional)

`total_count` (int, optional)

`deterministic_latents` (bool, optional)

`seed` (int, optional)

Valid perturbation metadata

Working with results

Differential expression

Important notes

Counts vector length

Gene order