> ## Documentation Index
> Fetch the complete documentation index at: https://docs.synthesize.bio/llms.txt
> Use this file to discover all available pages before exploring further.

# Baseline models

> Generate synthetic gene expression data from metadata alone.

## Overview

Baseline models generate synthetic gene expression data from metadata alone. You describe the biological conditions, such as tissue type, disease state, perturbations, and cell type, and the model generates realistic expression profiles matching those conditions.

This is the most common use case: generating synthetic data for conditions where real data may be scarce or unavailable.

## Available models

* **`gem-1-bulk`**: Bulk RNA-seq baseline model
* **`gem-1-sc`**: Single-cell RNA-seq baseline model

```python theme={null}
import pysynthbio
```

## Creating a query

The structure of the query required by the API is specific to each model. Use `get_example_query()` to get a correctly structured example for your chosen model.

```python theme={null}
example_query = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]
print(example_query)
```

The query consists of:

1. **`sampling_strategy`**: The prediction mode that controls how expression data is generated
   * `"sample generation"`: Generates realistic-looking synthetic data with measurement error (bulk only)
   * `"mean estimation"`: Provides stable mean estimates of expression levels (bulk and single-cell)
2. **`inputs`**: A list of biological conditions to generate data for

Each input contains `metadata` describing the biological sample and `num_samples` for how many samples to generate.

## Making a prediction

Once your query is ready, send it to the API to generate gene expression data:

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]
result = pysynthbio.predict_query(query, model_id="gem-1-bulk")
```

The result is a dictionary containing two DataFrames: `metadata` and `expression`.

### Single-cell example

```python theme={null}
sc_query = pysynthbio.get_example_query(model_id="gem-1-sc")["example_query"]
sc_result = pysynthbio.predict_query(sc_query, model_id="gem-1-sc")
```

<Note>
  Single-cell models only support `"mean estimation"` mode.
</Note>

## Query parameters

### `sampling_strategy` (str, required)

Controls the type of prediction the model generates. Required in all queries.

Available modes:

* **`"sample generation"`**: The model generates realistic-looking synthetic data that captures measurement error. Useful when you want data that mimics real experimental measurements. **Bulk only**
* **`"mean estimation"`**: The model creates a distribution capturing biological heterogeneity consistent with the supplied metadata, then returns the mean of that distribution. Useful when you want a stable estimate of expected expression levels. **Bulk and single-cell**

```python theme={null}
# Bulk query with sample generation
bulk_query = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]
bulk_query["sampling_strategy"] = "sample generation"

# Bulk query with mean estimation
bulk_query_mean = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]
bulk_query_mean["sampling_strategy"] = "mean estimation"

# Single-cell query (must use mean estimation)
sc_query = pysynthbio.get_example_query(model_id="gem-1-sc")["example_query"]
sc_query["sampling_strategy"] = "mean estimation"
```

### `total_count` (int, optional)

Library size used when converting predicted log CPM back to raw counts. Higher values scale counts up proportionally.

* Default: 10,000,000 for bulk; 10,000 for single-cell

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]
query["total_count"] = 5_000_000
```

### `deterministic_latents` (bool, optional)

If `True`, the model uses the mean of each latent distribution (`p(z|metadata)`) instead of sampling. This removes randomness from latent sampling and produces deterministic outputs for the same inputs.

* Default: `False`

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]
query["deterministic_latents"] = True
```

### `seed` (int, optional)

Random seed for reproducibility when using stochastic sampling.

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]
query["seed"] = 42
```

### Combining parameters

You can combine multiple parameters in a single query:

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]
query["total_count"] = 8_000_000
query["deterministic_latents"] = True
query["sampling_strategy"] = "mean estimation"

results = pysynthbio.predict_query(query, model_id="gem-1-bulk")
```

## Valid metadata keys

The input metadata is a dictionary. Here are all valid keys.

### Biological

* `age_years`
* `cell_line_ontology_id`
* `cell_type_ontology_id`
* `developmental_stage`
* `disease_ontology_id`
* `ethnicity`
* `genotype`
* `race`
* `sample_type`: `"cell line"`, `"organoid"`, `"other"`, `"primary cells"`, `"primary tissue"`, `"xenograft"`
* `sex`: `"male"`, `"female"`
* `tissue_ontology_id`

### Perturbational

* `perturbation_dose`: number and unit separated by a space, for example `"10 um"`
* `perturbation_ontology_id`
* `perturbation_time`: number and unit separated by a space, for example `"24 hours"`
* `perturbation_type`: one of `"coculture"`, `"compound"`, `"control"`, `"crispr"`, `"genetic"`, `"infection"`, `"other"`, `"overexpression"`, `"peptide or biologic"`, `"shrna"`, `"sirna"`

### Technical

* `study`: Bioproject ID
* `library_selection`: for example `"cDNA"`, `"polyA"`, `"Oligo-dT"` (see the [ENA documentation](https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection))
* `library_layout`: `"PAIRED"` or `"SINGLE"`
* `platform`: `"illumina"`

## Valid metadata values

The following are the valid values or expected formats for selected metadata keys:

| Metadata field             | Requirement / example                                                                                                                                                               |
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `cell_line_ontology_id`    | Requires a [Cellosaurus ID](https://www.cellosaurus.org/)                                                                                                                           |
| `cell_type_ontology_id`    | Requires a [CL ID](https://www.ebi.ac.uk/ols4/ontologies/cl)                                                                                                                        |
| `disease_ontology_id`      | Requires a [MONDO ID](https://www.ebi.ac.uk/ols4/ontologies/mondo)                                                                                                                  |
| `perturbation_ontology_id` | Must be a valid Ensembl gene ID, [ChEBI ID](https://www.ebi.ac.uk/chebi/), [ChEMBL ID](https://www.ebi.ac.uk/chembl/), or [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) |
| `tissue_ontology_id`       | Requires a [UBERON ID](https://www.ebi.ac.uk/ols4/ontologies/uberon)                                                                                                                |

We highly recommend the [EMBL-EBI Ontology Lookup Service](https://www.ebi.ac.uk/ols4/) for finding valid IDs.

Models have a limited acceptable range of metadata input values. If you provide a value outside the acceptable range, the API returns an error.

## Modifying query inputs

Customize the query inputs to fit your specific research needs:

```python theme={null}
# Get a base query
query = pysynthbio.get_example_query(model_id="gem-1-bulk")["example_query"]

# Adjust number of samples for the first input
query["inputs"][0]["num_samples"] = 10

# Add a new condition
query["inputs"].append({
    "metadata": {
        "sex": "male",
        "sample_type": "primary tissue",
        "tissue_ontology_id": "UBERON:0002371",
    },
    "num_samples": 5,
})
```

## Working with results

```python theme={null}
# Access metadata and expression matrices
metadata = result["metadata"]
expression = result["expression"]

# Check dimensions
print(expression.shape)

# View metadata sample
print(metadata.head())
```

You may want to process the data or save it for later use:

```python theme={null}
# Save results to files
expression.to_csv("expression_matrix.csv")
metadata.to_csv("sample_metadata.csv")

# Or save as pickle for later use
import pickle
with open("synthesize_results.pkl", "wb") as f:
    pickle.dump(result, f)
```
