> ## Documentation Index
> Fetch the complete documentation index at: https://docs.synthesize.bio/llms.txt
> Use this file to discover all available pages before exploring further.

# Metadata prediction

> Infer biological metadata from observed expression data.

## Overview

Metadata prediction models infer biological metadata from observed expression data. Given a gene expression profile, the model predicts likely biological characteristics such as cell type, tissue, disease state, and more.

This is useful when you want to:

* Annotate samples of unknown origin
* Validate sample labels against expression patterns
* Discover potential mislabeled or contaminated samples
* Understand the biological characteristics captured in expression data

## Available models

* **`gem-1-bulk_predict-metadata`**: Bulk RNA-seq metadata prediction model
* **`gem-1-sc_predict-metadata`**: Single-cell RNA-seq metadata prediction model

<Note>
  These endpoints may require 1 to 2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.
</Note>

```python theme={null}
import pysynthbio
```

## How it works

Metadata prediction encodes your expression data into the model's latent space and then uses classifiers to predict the most likely metadata values for each sample. The model returns:

1. **Classifier probabilities**: For each categorical metadata field, the probability distribution over possible values
2. **Predicted labels**: The most likely value for each metadata field
3. **Latent representations**: The biological, technical, and perturbation latent vectors

## Creating a query

Metadata prediction queries are simpler than other model types. You only need to provide expression counts:

```python theme={null}
example_query = pysynthbio.get_example_query(model_id="gem-1-bulk_predict-metadata")["example_query"]
print(example_query)
```

The query structure includes:

1. **`inputs`**: A list of count vectors, where each element is a dictionary with a `counts` field
2. **`seed`** (optional): Random seed for reproducibility

## Example: predicting sample metadata

A complete example predicting metadata for expression samples:

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk_predict-metadata")["example_query"]

# Replace with your actual expression counts.
# Each input should be a dictionary with a counts list.
query["inputs"] = [
    {"counts": sample1_counts},
    {"counts": sample2_counts},
    {"counts": sample3_counts},
]

# Optional: set seed for reproducibility
query["seed"] = 42

result = pysynthbio.predict_query(query, model_id="gem-1-bulk_predict-metadata")
```

## Example: single sample prediction

For predicting metadata of a single sample:

```python theme={null}
query = pysynthbio.get_example_query(model_id="gem-1-bulk_predict-metadata")["example_query"]

query["inputs"] = [
    {"counts": my_sample_counts},
]

result = pysynthbio.predict_query(query, model_id="gem-1-bulk_predict-metadata")

# Access predictions for the first (and only) sample
print(result[0]["metadata"])
```

## Query parameters

### `inputs` (list, required)

A list of expression count vectors. Each element should be a dictionary containing:

* **`counts`**: A list of non-negative integers representing gene expression counts

```python theme={null}
query["inputs"] = [
    {"counts": [0, 12, 5, 0, 33, 7]},  # Sample 1
    {"counts": [3, 0, 0, 7, 1, 0]},    # Sample 2
]
```

### `seed` (int, optional)

Random seed for reproducibility.

```python theme={null}
query["seed"] = 123
```

## Understanding the results

The results from metadata prediction are returned as a list of output dictionaries, one per input sample. Each output dictionary contains:

* `metadata`: Predicted metadata values for the sample
* `classifier_probs`: Probability distributions over possible values for each metadata field
* `latents`: Latent representations capturing biological, technical, and perturbation information

```python theme={null}
print(f"Number of outputs: {len(result)}")

# Access the first sample's output
first_output = result[0]
print(first_output.keys())
```

### Predicted metadata

Each output's `metadata` field contains the predicted values for that sample:

```python theme={null}
for i, output in enumerate(result):
    print(f"Sample {i}: {output['metadata']}")

# Access specific predictions for first sample
first_sample = result[0]["metadata"]
print(first_sample.get("cell_type_ontology_id"))
print(first_sample.get("tissue_ontology_id"))
print(first_sample.get("disease_ontology_id"))
```

### Classifier probabilities

For categorical metadata fields, the model returns probability distributions over all possible values. These are useful for understanding prediction confidence:

```python theme={null}
first_output = result[0]
cell_type_probs = first_output["classifier_probs"]["cell_type"]
sorted_probs = sorted(cell_type_probs.items(), key=lambda x: x[1], reverse=True)
print("Top predicted cell types:", sorted_probs[:5])
```

### Latent representations

The model also returns latent vectors that capture biological, technical, and perturbation characteristics:

```python theme={null}
first_output = result[0]
biological_latents = first_output["latents"]["biological"]
technical_latents = first_output["latents"]["technical"]
```

## Use cases

### Sample annotation

Annotate unlabeled samples with predicted metadata:

```python theme={null}
import pandas as pd

unlabeled_counts = pd.read_csv("unlabeled_samples.csv", index_col=0)

query = pysynthbio.get_example_query(model_id="gem-1-bulk_predict-metadata")["example_query"]
query["inputs"] = [
    {"counts": unlabeled_counts.iloc[:, i].tolist()}
    for i in range(unlabeled_counts.shape[1])
]

result = pysynthbio.predict_query(query, model_id="gem-1-bulk_predict-metadata")

annotations = pd.DataFrame([output["metadata"] for output in result])
annotations["sample_id"] = unlabeled_counts.columns.tolist()
```

### Quality control

Validate existing sample labels against predicted metadata:

```python theme={null}
provided_labels = ["UBERON:0002107", "UBERON:0002107", "UBERON:0000955", "UBERON:0000955"]
predicted_labels = [output["metadata"].get("tissue_ontology_id") for output in result]

mismatches = [
    i for i, (provided, predicted) in enumerate(zip(provided_labels, predicted_labels))
    if provided != predicted
]
if mismatches:
    print(f"Potential mislabeled samples: {mismatches}")
```

### Batch characterization

Understand batch-specific technical characteristics:

```python theme={null}
import numpy as np

batch_labels = ["batch1", "batch1", "batch2", "batch2"]

technical_latents = [output["latents"]["technical"] for output in result]
for batch in set(batch_labels):
    batch_indices = [i for i, batch_name in enumerate(batch_labels) if batch_name == batch]
    batch_mean = np.mean([technical_latents[i][0] for i in batch_indices])
    print(f"{batch} technical latent mean: {batch_mean}")
```

## Important notes

### Counts vector length

The counts vector for each sample must match the model's expected number of genes. If the length does not match, the API returns a validation error.

Use `get_example_query()` to see the expected structure.

### Gene order

Ensure your counts are in the same gene order expected by the model. The gene order should match what the baseline model expects. You can retrieve this from any prediction result's `gene_order` field.

### Non-negative counts

All count values must be non-negative integers. Floats that are whole numbers, such as `10.0`, are accepted, but negative values cause validation errors.
