WIP: RNA-Seq Data Analysis On-Ramp Notes

💡

This is a work-in-progress article/notes

Week 1 and 2 Notes

RNA-Seq Dataset features

  • Matrix of raw counts
    • Rows: genes
    • Columns: samples
    • Values: number of reads that correspond to the gene in a given sample
  • Metadata
    • Rows: Sample
    • Columns: Descriptive Variable
  • Alignment between count matrix and metadata
    • Filter and process metadata to find the subset of interest
    • Subset and reorder the columns to match sample IDs and the filtered metadata
    • Do not filter separately

CCLE database

  • Simulate cancer biology in vitro for target validation, biomarker discovery, screening novel therapeutics
    • Drug response prediction through relating omics data and drug sensitivity, i.e., identify/develop
      • Predictive models for drug efficacy
      • Mechanisms of resistance
      • Patient subgroups
  • Provides tissue and lineage context
    • Given cancer’s high heterogeneity, CCLE’s detailed annotations allow researchers to perform more granular analyses by tissue type or molecular subtype

Sequencing Depth

  • Even in the same dataset, different samples (columns) may have different sequencing depths; it does not mean one sample expressed more genes, it just had more reads sequenced
    • We need to normlalize for depth, and using DESeq2 package it can estimate a size factor for each sample to normalize for sequencing depth

Source: https://www.biostars.org/p/480419/

image

Heteroskedasticity

  • For differential expression, DESeq2 models this mean-variance relationship using a Negative Binomial distribution and estimates gene-specific dispersion.
  • For visualization or clustering, DESeq2 provides variance-stabilized counts (via vst() or rlog()), which reduce the dependence between mean and variance.

VST-transformed data vs. Raw Counts

  • Raw counts are integers with mean-variance dependency/correlation, which violates the assumptions of many downstream statistical methods we might want to apply to the data; for instance linear regression requires homoscedasticity
  • VST transforms the counts into a scale where the variance is independent of the mean gene expression value, which makes it easier to perform PCA and clustering.
  • This also helps enhance our ability to compare across samples, since different samples might be biologically same but simply have different sequencing depth; but note that library size normalization occurs before VST
  • Not that VST-transformed counts should not be used for DE analyses; because VST distort count distribution and are not designed to preserve statistical properties necessary for accurate p-values or fold-changes; VST More specifically, VSTcompresses the range of expression values, especially high gene expression values, and this renders true fold changes less clear.