Week 1 and 2 Notes

Simulate cancer biology in vitro for target validation, biomarker discovery, screening novel therapeutics

Drug response prediction through relating omics data and drug sensitivity, i.e., identify/develop

Given cancer’s high heterogeneity, CCLE’s detailed annotations allow researchers to perform more granular analyses by tissue type or molecular subtype

Even in the same dataset, different samples (columns) may have different sequencing depths; it does not mean one sample expressed more genes, it just had more reads sequenced

We need to normlalize for depth, and using DESeq2 package it can estimate a size factor for each sample to normalize for sequencing depth

For differential expression, DESeq2 models this mean-variance relationship using a Negative Binomial distribution and estimates gene-specific dispersion.
For visualization or clustering, DESeq2 provides variance-stabilized counts (via vst() or rlog()), which reduce the dependence between mean and variance.

Raw counts are integers with mean-variance dependency/correlation, which violates the assumptions of many downstream statistical methods we might want to apply to the data; for instance linear regression requires homoscedasticity
VST transforms the counts into a scale where the variance is independent of the mean gene expression value, which makes it easier to perform PCA and clustering.
This also helps enhance our ability to compare across samples, since different samples might be biologically same but simply have different sequencing depth; but note that library size normalization occurs before VST
Not that VST-transformed counts should not be used for DE analyses; because VST distort count distribution and are not designed to preserve statistical properties necessary for accurate p-values or fold-changes; VST More specifically, VSTcompresses the range of expression values, especially high gene expression values, and this renders true fold changes less clear.

WIP: RNA-Seq Data Analysis On-Ramp Notes