Synthesize Bio uses public RNA-seq datasets to support model training and platform workflows. This page summarizes how those datasets are selected, processed, and quality-checked before they appear in the platform.Documentation Index
Fetch the complete documentation index at: https://docs.synthesize.bio/llms.txt
Use this file to discover all available pages before exploring further.
Data sources
All bulk RNA-seq data is sourced from the NCBI Sequence Read Archive (SRA). We select samples that meet these criteria:- Illumina sequencing platform
- Transcriptomic library source
- Predicted bulk RNA-seq, not single-cell
- No fractionation library selection
Expression data processing
Synthesize Bio uses pseudoalignment to quantify transcript-level and gene-level abundance from FASTQ files. When multiple runs correspond to the same sample, we combine the FASTQ files before processing. Reads are trimmed for adapters and sequence quality usingfastp. Trimmed reads are quantified to genes and transcripts with kallisto v0.50.1 against the GRCh38 human or GRCm39 mouse reference genome and the Ensembl r111 transcriptome. Estimated transcript counts are then summed to the gene level based on the transcriptome definition.
Quality control
We use a simple quality-control process across RNA-seq processing stages:- Use
fastpto evaluate read quality, trim reads, and capture the duplication rate. - Calculate the percent pseudoaligned from
kallisto. - Calculate the total number of genes with counts greater than 0.
- Percent aligned reads below 50%
- Duplication rate above 80%
- Fewer than 10,000 genes with counts greater than 0
How to interpret quality flags
The best cutoffs depend on the RNA-seq protocol and organism. Total RNA, poly(A)-selected RNA, stranded protocols, and non-stranded protocols can have different expected quality profiles. High duplication rates may indicate over-amplification during PCR steps or poor initial RNA input quality. Higher duplication rates may still be acceptable for some protocols, such as low-input workflows. The number of detected non-zero genes depends on sample complexity, tissue or cell type, and sequencing depth.Metadata curation
Sample-level annotation is curated through a semi-automated Synthesize Bio process. We curate 15 fields, such as tissue, sex, and disease, and map values to ontologies when available. We do not manually review every harmonized metadata result. Review metadata for accuracy when you need high confidence in a specific sample or study.References
- fastp
- Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107
- Ensembl
- kallisto
- NL Bray, H Pimentel, P Melsted, and L Pachter. 2016. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34, 525-527. https://www.nature.com/articles/nbt.3519