Documentation Index
Fetch the complete documentation index at: https://docs.synthesize.bio/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Baseline models generate synthetic gene expression data from metadata alone. You describe the biological conditions, such as tissue type, disease state, perturbations, and cell type, and the model generates realistic expression profiles matching those conditions. This is the most common use case: generating synthetic data for conditions where real data may be scarce or unavailable.Available models
gem-1-bulk: Bulk RNA-seq baseline modelgem-1-sc: Single-cell RNA-seq baseline model
Creating a query
The structure of the query required by the API is specific to each model. Useget_example_query() to get a correctly structured example for your chosen model.
sampling_strategy: The prediction mode that controls how expression data is generated"sample generation": Generates realistic-looking synthetic data with measurement error (bulk only)"mean estimation": Provides stable mean estimates of expression levels (bulk and single-cell)
inputs: A list of biological conditions to generate data for
metadata describing the biological sample and num_samples for how many samples to generate.
Making a prediction
Once your query is ready, send it to the API to generate gene expression data:metadata and expression.
Single-cell example
Single-cell models only support
"mean estimation" mode.Query parameters
sampling_strategy (str, required)
Controls the type of prediction the model generates. Required in all queries.
Available modes:
"sample generation": The model generates realistic-looking synthetic data that captures measurement error. Useful when you want data that mimics real experimental measurements. Bulk only"mean estimation": The model creates a distribution capturing biological heterogeneity consistent with the supplied metadata, then returns the mean of that distribution. Useful when you want a stable estimate of expected expression levels. Bulk and single-cell
total_count (int, optional)
Library size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally.
- Default: 10,000,000 for bulk; 10,000 for single-cell
deterministic_latents (bool, optional)
If True, the model uses the mean of each latent distribution (p(z|metadata)) instead of sampling. This removes randomness from latent sampling and produces deterministic outputs for the same inputs.
- Default:
False
seed (int, optional)
Random seed for reproducibility when using stochastic sampling.
Combining parameters
You can combine multiple parameters in a single query:Valid metadata keys
The input metadata is a dictionary. Here are all valid keys.Biological
age_yearscell_line_ontology_idcell_type_ontology_iddevelopmental_stagedisease_ontology_idethnicitygenotyperacesample_type:"cell line","organoid","other","primary cells","primary tissue","xenograft"sex:"male","female"tissue_ontology_id
Perturbational
perturbation_dose: number and unit separated by a space, for example"10 um"perturbation_ontology_idperturbation_time: number and unit separated by a space, for example"24 hours"perturbation_type: one of"coculture","compound","control","crispr","genetic","infection","other","overexpression","peptide or biologic","shrna","sirna"
Technical
study: Bioproject IDlibrary_selection: for example"cDNA","polyA","Oligo-dT"(see the ENA documentation)library_layout:"PAIRED"or"SINGLE"platform:"illumina"
Valid metadata values
The following are the valid values or expected formats for selected metadata keys:| Metadata field | Requirement / example |
|---|---|
cell_line_ontology_id | Requires a Cellosaurus ID |
cell_type_ontology_id | Requires a CL ID |
disease_ontology_id | Requires a MONDO ID |
perturbation_ontology_id | Must be a valid Ensembl gene ID, ChEBI ID, ChEMBL ID, or NCBI Taxonomy ID |
tissue_ontology_id | Requires a UBERON ID |