This is a work-in-progress article/notes
Week 1 and 2 Notes
RNA-Seq Dataset features
- Matrix of raw counts
- Rows: genes
- Columns: samples
- Values: number of reads that correspond to the gene in a given sample
- Metadata
- Rows: Sample
- Columns: Descriptive Variable
- Alignment between count matrix and metadata
- Filter and process metadata to find the subset of interest
- Subset and reorder the columns to match sample IDs and the filtered metadata
- Do not filter separately
CCLE database
- Simulate cancer biology in vitro for target validation, biomarker discovery, screening novel therapeutics
- Drug response prediction through relating omics data and drug sensitivity, i.e., identify/develop
- Predictive models for drug efficacy
- Mechanisms of resistance
- Patient subgroups
- Provides tissue and lineage context
- Given cancer’s high heterogeneity, CCLE’s detailed annotations allow researchers to perform more granular analyses by tissue type or molecular subtype
Sequencing Depth
- Even in the same dataset, different samples (columns) may have different sequencing depths; it does not mean one sample expressed more genes, it just had more reads sequenced
- We need to normlalize for depth, and using
DESeq2
package it can estimate a size factor for each sample to normalize for sequencing depth
Source: https://www.biostars.org/p/480419/
Heteroskedasticity
- For differential expression,
DESeq2
models this mean-variance relationship using a Negative Binomial distribution and estimates gene-specific dispersion. - For visualization or clustering, DESeq2 provides variance-stabilized counts (via
vst()
orrlog()
), which reduce the dependence between mean and variance.
VST-transformed data vs. Raw Counts
- Raw counts are integers with mean-variance dependency/correlation, which violates the assumptions of many downstream statistical methods we might want to apply to the data; for instance linear regression requires homoscedasticity
- VST transforms the counts into a scale where the variance is independent of the mean gene expression value, which makes it easier to perform PCA and clustering.
- This also helps enhance our ability to compare across samples, since different samples might be biologically same but simply have different sequencing depth; but note that library size normalization occurs before VST
- Not that VST-transformed counts should not be used for DE analyses; because VST distort count distribution and are not designed to preserve statistical properties necessary for accurate p-values or fold-changes; VST More specifically, VSTcompresses the range of expression values, especially high gene expression values, and this renders true fold changes less clear.