Documentation Index
Fetch the complete documentation index at: https://docs.synthesize.bio/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Metadata prediction models infer biological metadata from observed expression data. Given a gene expression profile, the model predicts likely biological characteristics such as cell type, tissue, disease state, and more. This is useful when you want to:- Annotate samples of unknown origin
- Validate sample labels against expression patterns
- Discover potential mislabeled or contaminated samples
- Understand the biological characteristics captured in expression data
Available models
gem-1-bulk_predict-metadata: Bulk RNA-seq metadata prediction modelgem-1-sc_predict-metadata: Single-cell RNA-seq metadata prediction model
These endpoints may require 1 to 2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.
How it works
Metadata prediction encodes your expression data into the model’s latent space and then uses classifiers to predict the most likely metadata values for each sample. The model returns:- Classifier probabilities: For each categorical metadata field, the probability distribution over possible values
- Predicted labels: The most likely value for each metadata field
- Latent representations: The biological, technical, and perturbation latent vectors
Creating a query
Metadata prediction queries are simpler than other model types. You only need to provide expression counts:inputs: A list of count vectors, where each element is a dictionary with acountsfieldseed(optional): Random seed for reproducibility
Example: predicting sample metadata
A complete example predicting metadata for expression samples:Example: single sample prediction
For predicting metadata of a single sample:Query parameters
inputs (list, required)
A list of expression count vectors. Each element should be a dictionary containing:
counts: A list of non-negative integers representing gene expression counts
seed (int, optional)
Random seed for reproducibility.
Understanding the results
The results from metadata prediction are returned as a list of output dictionaries, one per input sample. Each output dictionary contains:metadata: Predicted metadata values for the sampleclassifier_probs: Probability distributions over possible values for each metadata fieldlatents: Latent representations capturing biological, technical, and perturbation information
Predicted metadata
Each output’smetadata field contains the predicted values for that sample:
Classifier probabilities
For categorical metadata fields, the model returns probability distributions over all possible values. These are useful for understanding prediction confidence:Latent representations
The model also returns latent vectors that capture biological, technical, and perturbation characteristics:Use cases
Sample annotation
Annotate unlabeled samples with predicted metadata:Quality control
Validate existing sample labels against predicted metadata:Batch characterization
Understand batch-specific technical characteristics:Important notes
Counts vector length
The counts vector for each sample must match the model’s expected number of genes. If the length does not match, the API returns a validation error. Useget_example_query() to see the expected structure.
Gene order
Ensure your counts are in the same gene order expected by the model. The gene order should match what the baseline model expects. You can retrieve this from any prediction result’sgene_order field.
Non-negative counts
All count values must be non-negative integers. Floats that are whole numbers, such as10.0, are accepted, but negative values cause validation errors.