Analisi Bulk RNA-Seq per Dottorandi

Introduzione all’RNASeq

Introduzione all’analisi RNASeq in R

Dipartimento di Biomedicina e Prevenzione

Marco Chiapello, Revelo Datalab

Nomber 2024

Pre-processing and Quality Control

Raw reads: The Starting Point

FASTA file

Definition

The FASTA format is a text-based format for representing either nucleotide sequences or amino acid sequences. Both nucleotides and amino acids are represented using single-letter codes. The first line before the nucleotide/amino acid sequence contains the name of the sequence, preceded by the “>” symbol.

File extension could be: fasta, fna, ffn, faa, fa, frn

Raw reads: The Starting Point

FASTA file

Definition

>NM_001404729.1 Oryza sativa ribulose bisphosphate carboxylase small chain A
CTCAACAGCACTGCTACTGGACATACTCTACTACTACTAGCCAGTAAGCTAGCTAACTAACTACGTGGCT
ATGGCCCCCACCGTGATGGCCTCCTCGGCCACCTCCGTGGCTCCATTCCAAGGGCTCAANNNNNNNNNNN

Raw reads: The Starting Point

FASTA file

Multi-fasta file:

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK

>NM_001404729.1 Oryza sativa ribulose bisphosphate carboxylase small chain A
CTCAACAGCACTGCTACTGGACATACTCTACTACTACTAGCCAGTAAGCTAGCTAACTAACTACGTGGCT
ATGGCCCCCACCGTGATGGCCTCCTCGGCCACCTCCGTGGCTCCATTCCAAGGGCTCAANNNNNNNNNNN
>HF583486.1 Homo sapiens SOD1 gene for alternative protein SOD1, isolate 144496
ATGGATTCCATGTTCATGAGTTTGGAGATAATACAGCAGGCTGTACCAGTGCAGGTCCTCACTTTAATCC
TCTATCCAGAAAACACGGTGGGCCAAAGGATGAAGAGAGGCATGTTGGAGACTTGGGCAATGTGA

Raw reads: The Starting Point

FASTQ format

Definition

The FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

Raw reads: The Starting Point

FASTQ format

Definition

A FASTQ file is identified by 4 lines:

Line 1: begins with the “@” symbol and is followed by the sequence identifier (as in FASTA sequences)
Line 2: contains the nucleotide sequence
Line 3: begins with a “+” and may contain additional sequence descriptions
Line 4: contains the quality values for the individual nucleotides

Raw reads: The Starting Point

FASTQ format

Definition

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Raw reads: The Starting Point

FASTQ format

What does “quality score” mean?

The quality score is used to identify the probability of the correct identification of the corresponding nucleotide.

Raw reads: The Starting Point

FASTQ format

What does “quality score” mean?

The quality score is used to identify the probability of the correct identification of the corresponding nucleotide.

Raw reads: The Starting Point

FASTQ format

FASTQ files can be very large, so in most cases you won’t be dealing with myFile.fastq, but with myFile.fastq.gz.
This means that the file is compressed, so it is not readable by a text editor without first being decompressed.

Pre-processing Tools

FastQC

What it is:
- A widely used tool for quality assessment of raw sequencing data.
What it does:
- Generates comprehensive reports with various quality metrics.
- Visualizes data quality using interactive plots and graphs.
Key metrics:
- Per base sequence quality
- Per sequence quality scores
- Sequence length distribution
- GC content
- Adapter content
- Overrepresented sequences

FastQC

Pre-processing Tools

UMI-tools extract

What are UMIs?
- Unique Molecular Identifiers (UMIs) are short random sequences attached to each cDNA molecule during library preparation.
- They act as “molecular barcodes” to identify unique molecules.
Why are UMIs important?
- UMIs help to distinguish PCR duplicates from true biological duplicates.
- This improves the accuracy of gene expression quantification, especially for low-abundance transcripts.

What does UMI-tools extract do?
- Identifies and extracts UMIs from sequencing reads.
- Moves UMIs to the read name for downstream analysis.
- Requires information about the UMI location in the read.
Benefits:
- Improved accuracy in gene expression quantification.
- Reduced bias due to PCR amplification.
  
  UMI-tools

Obiettivo: Spiegare perché gli UMI risolvono il problema dei duplicati PCR in modo più accurato rispetto ai metodi basati sulla posizione genomica.

Gli UMI (Unique Molecular Identifiers) sono brevi sequenze casuali (6-12 nt) aggiunte al cDNA durante la preparazione della libreria, prima dell’amplificazione PCR
Ogni molecola di cDNA riceve un UMI unico: dopo la PCR, tutte le copie dello stesso trascritto originale avranno il medesimo UMI
Senza UMI non possiamo distinguere due reads identiche provenienti dallo stesso trascritto biologico da copie PCR della stessa molecola
UMI-tools extract sposta la sequenza UMI dall’inizio della read al nome della read, per usarla nella deduplicazione downstream
Questo step è rilevante solo se il vostro protocollo prevede UMI (es. 10x Chromium, CEL-Seq2, certi kit Illumina per RNA totale)

Pre-processing Tools

FastP & Trim Galore!

Purpose:
- To clean up raw sequencing data and improve the accuracy of downstream analysis.
Key functions:
- Adapter trimming: Remove adapter sequences from reads.
- Quality trimming: Trim low-quality bases from the ends of reads.
FastP:
- All-in-one tool with extensive pre-processing features.
- Includes quality filtering, error correction, UMI handling, and more.

Trim Galore!:
- Wrapper script for cutadapt (adapter trimming) and often includes FastQC (quality assessment).
- Focuses on adapter and quality trimming with optimized parameters.
Benefits:
- Improved alignment accuracy.
- Reduced false positives in variant calling and gene expression analysis.
- Increased efficiency of downstream analysis.

Obiettivo: Capire perché il trimming migliora la qualità dell’analisi downstream e quando scegliere FastP vs Trim Galore!.

Il trimming degli adattatori rimuove le sequenze Illumina che contaminano le reads quando l’inserto cDNA è più corto del ciclo di lettura — senza rimozione causano errate mappature
Il quality trimming taglia le basi di bassa qualità alle estremità (Q < 20): migliora il tasso di allineamento e la precisione della quantificazione
FastP è più moderno: gestisce in un unico passaggio adattatori, quality, lunghezza minima e UMI — più veloce e con report HTML integrato
Trim Galore! è uno wrapper consolidato di cutadapt con parametri ottimizzati per Illumina: scelta più conservativa e ampiamente validata
In nf-core potete scegliere tra i due con il parametro --trimmer; il default è Trim Galore!

Pre-processing Tools

BBSplit

What it is:
- A tool for removing contaminant reads from sequencing data.
Types of contamination:
- Microbial DNA or RNA
- Host DNA (in the case of RNA-Seq from host-associated samples)
- Other unwanted sequences
How it works:
- Aligns reads to a database of contaminant sequences.
- Removes reads that map to the contaminant database.

BBMap

Pre-processing Tools

SortMeRNA

What it is:
- A tool for filtering ribosomal RNA (rRNA) reads from sequencing data.
Why is it important?
- rRNA often makes up a large proportion of total RNA.
- Removing rRNA enriches for mRNA and other non-coding RNAs of interest.
How it works:
- Uses a database of rRNA sequences to identify and remove rRNA reads.
- Can be used with both single-end and paired-end data.

SortMeRNA

Alignment and Quantification

Read Alignment

Purpose:
- To map sequencing reads to a reference genome or transcriptome.
Why is it important?
- Identify the genomic origin of each read.
- Determine which genes or transcripts are expressed.
- Discover genetic variations (SNPs, indels).

Challenges:
- Reads may contain sequencing errors.
- Reads may map to multiple locations (multi-mapping reads).
- Genomes contain repetitive regions.
Output: A BAM (Binary Alignment/Map) file containing alignment information.

Alignment and Quantification

SAM/BAM Format: Storing Alignment Information

What it is:
- The standard file format for storing read alignments.
- SAM (Sequence Alignment/Map): A text-based format.
- BAM (Binary Alignment/Map): A compressed binary version of SAM.
Why is it important?
- Provides a standardized way to store alignment information.
- Allows for efficient storage and retrieval of large alignment datasets.
- Facilitates compatibility between different bioinformatics tools.

Alignment and Quantification

SAM/BAM Format: Storing Alignment Information

Content:
- Header section: Contains metadata about the alignment (reference genome, aligner used, etc.).

@RG     ID:1    SM:C5926_BM_IonCode_0118
@PG     ID:samtools     PN:samtools    VN:1.16.1    CL:samtools view -H C5926_BM_IonCode_0118.reassembled.bam

Alignment and Quantification

SAM/BAM Format: Storing Alignment Information

Alignment section: Each line represents a single read and its alignment to the reference.
- Includes information about mapping position, quality scores, alignment flags, and more.

Alignment and Quantification

STAR Aligner

What it is:
- A widely used splice-aware aligner for RNA-Seq data.
Key features:
- Fast and efficient alignment.
- Accurate handling of spliced reads.
- Can detect novel splice junctions.
- Supports various RNA-Seq protocols.
Advantages:
- High accuracy
- Speed
- Versatility

Obiettivo: Presentare STAR come allineatore splice-aware e spiegare perché è lo standard de facto per RNA-seq.

STAR (Spliced Transcripts Alignment to a Reference): progettato specificamente per RNA-seq, con gestione nativa delle giunzioni di splicing
È essenziale essere splice-aware: una read può attraversare la giunzione tra due esoni separati da un introne di migliaia di basi nel genoma — un allineatore genomico ordinario non la gestirebbe correttamente
Novel splice junctions: STAR rileva giunzioni di splicing non ancora annotate nel file GTF/GFF — prezioso per studi di trascrittoma
Produce file BAM ordinati con coordinate genomiche; richiede circa 30 GB di RAM per il genoma umano (costruisce un grande indice in memoria)
In nf-core/rnaseq è l’allineatore di default; HISAT2 si attiva con il parametro --aligner hisat2

Alignment and Quantification

HISAT2

What it is:
- Another popular splice-aware aligner for RNA-Seq.
Key features:
- Based on the Burrows-Wheeler transform.
- Fast and memory-efficient.
- Supports both DNA and RNA alignment.
Comparison to STAR:
- May be slightly faster for some datasets.
- May have slightly lower accuracy for certain types of reads.
- Choice often depends on specific needs and preferences.

Alignment and Quantification

UMI-tools dedup

What it does:
- Identifies and removes PCR duplicates from RNA-Seq data using Unique Molecular Identifiers (UMIs).
- Groups reads with the same UMI and genomic mapping location.
- Selects the read with the highest quality score as the representative read for each group.
Why is it important?
- PCR amplification during library preparation can introduce duplicate reads.
- These duplicates can bias gene expression estimates, especially for low-abundance transcripts.
- UMI-tools dedup helps to correct for this bias and improve accuracy.
Benefits:
- More accurate gene expression quantification.
- Reduced false positives in differential expression analysis.
- Improved sensitivity for detecting rare transcripts.

UMI-tools

Alignment and Quantification

RSEM

What it is:
- RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data.
Key features:
- Uses a statistical model to estimate transcript abundance from aligned reads.
- Can handle different types of RNA-Seq data (single-end, paired-end).
- Provides various output formats (counts, TPM, FPKM).

Pseudo-aligners

A New Approach to Quantification

Pseudo-aligners

Fast and Efficient Transcript Abundance Estimation

What they are:
- A class of RNA-Seq quantification tools that do not require exact alignment of reads to a reference genome.
How they work:
- Use lightweight algorithms to assign reads to transcripts
- Effectively “pseudo-align” reads to transcripts without determining their precise genomic location.
Advantages:
- Speed: Significantly faster than traditional alignment-based methods.
- Efficiency: Reduced computational resources required.
- Accuracy: Comparable or even superior accuracy for transcript quantification.

Popular tools:
- Salmon
- Kallisto
Applications:
- Gene expression profiling
- Differential expression analysis
- Transcript isoform identification

Obiettivo: Spiegare il concetto di pseudo-allineamento e i suoi vantaggi rispetto all’allineamento classico.

Pseudo-allineamento: invece di trovare la posizione esatta della read sul genoma, si determina solo a quale trascritto appartiene — un concetto detto compatibilità — eliminando l’allineamento tradizionale a coppie
La differenza fondamentale con STAR/HISAT2 è la velocità: Salmon è 10-100× più rapido perché lavora sull’indice del trascrittoma, non sull’intero genoma
Accuratezza comparabile: il modello di bias di Salmon (GC content, posizione, sequenza) porta a stime dell’espressione altrettanto affidabili rispetto agli allineatori classici
È importante sapere che i pseudo-allineatori quantificano i trascritti: i counts vengono poi aggregati a livello di gene con tximport in R

Pseudo-aligners

Kallisto vs Salmon

Feature	Kallisto	Salmon
Algorithm	Pseudoalignment (k-mers)	Quasi-mapping (BWT)
Speed	Fast	Faster
Memory Usage	Lower	Higher
Features	Basic	More advanced (library types, bootstrapping)
Accuracy	High	High
Ease of Use	High	High
Output Formats	Abundance estimates	Abundance estimates, counts, TPM, etc.
Applications	Gene expression, DE analysis	Gene expression, DE analysis, isoform quantification

Pseudo-aligners

Salmon Output File Formats

Obiettivo: Descrivere i file di output di Salmon e identificare quello che useremo in R.

quant.sf è il file principale di Salmon: contiene per ogni trascritto le colonne Name, Length, EffectiveLength, TPM e NumReads — è questo file che importiamo in R con tximport
cmd_info.json registra i parametri esatti del comando: fondamentale per la riproducibilità e per documentare l’analisi
La cartella aux_info/ contiene meta_info.json con statistiche del run (frammenti osservati, mappati, tasso di mappatura) e — se Salmon ha inferito automaticamente il tipo di libreria — l’esito dell’inferenza
lib_format_counts.json riporta quanti frammenti sono compatibili con ciascun tipo di libreria: utile per verificare che il parametro --libType sia corretto
Nella pratica del corso: lavoreremo direttamente con quant.sf tramite tximport, che aggrega i conteggi a livello di gene e li prepara per DESeq2

Post-alignment Processing

Refining Your Alignments

What it is:
- A series of steps performed after read alignment to improve data quality and prepare for downstream analysis.
Key steps:
- Deduplication: Removing PCR duplicates to avoid bias in gene expression estimates. (Blue line)
- Generating coverage tracks: Visualizing read distribution and assessing sequencing uniformity.
- (Optional) Other steps: Depending on the analysis goals, this may include filtering reads, recalibrating base quality scores, and other processing steps.
Why is it important?
- Improves the accuracy and reliability of downstream analysis.
- Reduces noise and bias in the data.
- Facilitates data interpretation and visualization.

Post-alignment Processing

SAMtools: Manipulating and Analyzing Alignments

What it is:
- A suite of command-line utilities for working with SAM/BAM files.
- A Swiss Army Knife for SAM/BAM Files
Key functions:
- samtools sort: Sorts alignments in a BAM file by coordinate.
  - Why is sorting important? Many downstream tools require sorted BAM files for efficient processing.
- samtools index: Creates an index file for a sorted BAM file.
  - Why is indexing important? Allows for fast retrieval of specific regions of the alignment.
- samtools stats: Generates statistics about an alignment file.
  - Provides insights into alignment quality, read depth, and other metrics.

Post-alignment Processing

Picard MarkDuplicates: Deduplication without UMIs

How it works:
- Identifies duplicates based on mapping coordinates of reads.
- Marks duplicate reads in the BAM file.
Considerations:
- Less accurate than UMI-based methods.
- Can be used when UMIs are not available.
- Often used in combination with other filtering steps.

Obiettivo: Spiegare la deduplicazione basata su coordinate come alternativa ai metodi UMI.

Picard MarkDuplicates identifica i duplicati confrontando le coordinate di inizio allineamento delle coppie di reads: se due read-pairs iniziano esattamente alla stessa posizione, vengono considerate duplicati
Il tool marca i duplicati nel campo FLAG del BAM senza rimuoverli, lasciando la scelta all’utente — approccio conservativo che permette ispezioni successive
Limitazione importante: senza UMI (Unique Molecular Identifiers), reads indipendenti che iniziano alla stessa coordinata vengono erroneamente marcate come duplicati — rischio di over-deduplication in regioni ad alta copertura
nf-core applica Picard come step standard; se il protocollo include UMI, si usa invece il modulo UMI-tools

Post-alignment Processing

BEDTools genomecov: Quantifying Read Depth

What it does:
- Calculates the number of reads overlapping each position in the genome.
- Outputs a coverage file in various formats (BEDGRAPH, bed, etc.).
Options:
- Calculate coverage for different strand orientations.
- Normalize coverage by library size.
- Generate histograms of coverage depth.

Obiettivo: Spiegare l’utilità del calcolo della copertura per la visualizzazione e il controllo qualità.

BEDTools genomecov calcola per ogni posizione del genoma quante reads la coprono, generando file in formato BEDGRAPH o bigWig — i formati standard per browser genomici come IGV o UCSC
Le coverage tracks sono il primo strumento visivo per verificare che le reads si distribuiscano uniformemente sui corpi genici — un segnale di problemi di degradazione o bias di 3’ nella libreria emerge chiaramente da questi grafici
La normalizzazione per dimensione di libreria permette di confrontare campioni con profondità di sequenziamento diverse nello stesso browser genomico
Nella pratica: caricare le tracce bigWig in IGV è uno step di QC visivo che nessun report automatico può sostituire completamente

Downstream Analysis

RSeQC: Comprehensive Quality Assessment of Your RNA-Seq Data

What it is:
- A suite of tools for RNA-Seq quality control and analysis.
Key functions:
- Read distribution: Assess the distribution of reads across different genomic features (e.g., coding regions, introns, UTRs).
- Gene body coverage: Evaluate the uniformity of coverage across gene bodies.
- Strand specificity: Check the strand specificity of the library.
- Junction saturation: Determine if sufficient sequencing depth has been achieved for junction detection.
- Other QC metrics: Assess rRNA contamination, insert size distribution, and more.
Benefits:
- Provides comprehensive quality assessment of RNA-Seq data.
- Identifies potential biases and issues in library preparation or sequencing.
- Helps to ensure the reliability and accuracy of downstream analysis.

Obiettivo: Presentare RSeQC come strumento per identificare bias sistemici negli allineamenti RNA-seq.

RSeQC calcola metriche di QC che FastQC non può fornire perché richiede l’allineamento al genoma: la distribuzione delle reads sui diversi elementi genomici (esoni, introni, regioni intergeniche) è un indicatore fondamentale della qualità della libreria
Gene body coverage: se le reads si concentrano all’estremità 3’ del gene invece di distribuirsi uniformemente, indica degradazione dell’RNA prima della preparazione della libreria
Strand specificity: verifica che il protocollo di preparazione della libreria sia stato applicato correttamente — un mismatch qui può causare errori sistematici nella quantificazione
Junction saturation: indica se stiamo sequenziando abbastanza profondamente da individuare la maggior parte delle giunzioni di splicing note

Downstream Analysis

Preseq: Assessing the Diversity of Your Library

What it is: A tool for estimating the complexity of a sequencing library.
Why is library complexity important?
- Library complexity refers to the number of unique DNA fragments in your library.
- Higher complexity means a more diverse representation of the original RNA population.
- Low complexity can lead to reduced sequencing efficiency and biased results.
What does Preseq do?
- Uses statistical modeling to predict the number of unique reads that would be obtained with deeper sequencing.
- Helps to assess whether additional sequencing is likely to yield new information.
- Can be used to compare the complexity of different libraries.
Benefits:
- Optimize sequencing depth.
- Avoid unnecessary sequencing costs.
- Improve the quality and efficiency of RNA-Seq experiments.

Obiettivo: Spiegare il concetto di complessità di libreria e quando vale la pena sequenziare di più.

La complessità della libreria misura quante molecole distinte di cDNA erano presenti prima dell’amplificazione PCR: una libreria complessa rappresenta meglio l’intera popolazione di RNA del campione
Preseq usa un modello statistico per predire quante reads uniche otterreste raddoppiando la profondità di sequenziamento — se la curva si appiattisce, il campione è saturo e sequenziare di più non porta nuove informazioni
Una libreria con bassa complessità (curva piatta già a basse runs) può indicare eccessiva amplificazione PCR, input di RNA insufficiente o scarsa qualità dell’RNA di partenza
Questo strumento è utile per ottimizzare i costi: se la curva non è ancora satura, ha senso investire in maggiore profondità

Downstream Analysis

Qualimap RNA-Seq: A Deep Dive into Your RNA-Seq Data

What it is:
- A comprehensive tool for quality control and analysis of RNA-Seq data.
Key features:
- Alignment quality: Assesses mapping quality, mismatch rates, and indel rates.
- Coverage analysis: Evaluates coverage uniformity across genes and transcripts.
- Junction analysis: Examines splice junctions and identifies non-canonical junctions.
- Transcript coverage: Assesses coverage of known transcripts and identifies potential novel transcripts.
- rRNA content: Estimates the proportion of rRNA reads in the data.
- 5’ and 3’ bias: Detects biases in read coverage towards the 5’ or 3’ ends of transcripts.
Benefits:
- Provides detailed insights into the quality of RNA-Seq alignments.
- Identifies potential biases and technical artifacts.
- Helps to ensure the reliability and accuracy of downstream analysis.

Obiettivo: Illustrare Qualimap come strumento di QC approfondito sugli allineamenti, complementare a RSeQC.

Qualimap analizza direttamente i file BAM e produce report dettagliati su qualità del mapping, errori di mismatch e indel, distribuzione della copertura per gene e per trascritto
Il bias 5’/3’ è uno degli indicatori più utili: una copertura molto sbilanciata verso il 3’ dei trascritti indica degradazione dell’RNA (tipico di campioni FFPE o con RIN basso)
La stima di rRNA residua è fondamentale: in un esperimento poly-A selection vi aspettate < 5% di rRNA; valori più alti indicano problemi nel protocollo di selezione
Notate che Qualimap e RSeQC si sovrappongono parzialmente: entrambi sono inclusi in nf-core perché si integrano bene con MultiQC e producono metriche complementari

Downstream Analysis

Kraken2: Identifying Microbial Communities in Your Samples

What it is:
- A fast and accurate tool for assigning taxonomic labels to sequencing reads.
Why is it useful for RNA-Seq?
- RNA-Seq data can sometimes contain reads from microbial organisms (e.g., in microbiome studies, host-associated samples).
- Kraken2 can identify and classify these microbial reads, providing insights into the microbial community present in the sample.
How it works:
- Assigns taxonomic classifications to reads based on the best match.
Benefits:
- Fast and efficient classification.
- Accurate identification of microbial species.
- Can be used with both DNA and RNA sequencing data.

Obiettivo: Spiegare quando e perché usare Kraken2 in un esperimento RNA-seq.

Kraken2 assegna ogni read a un organismo tramite un database di k-meri: è estremamente rapido (processa milioni di reads al minuto) e accurato per la classificazione tassonomica
In RNA-seq è utile per rilevare contaminazione biologica: colture cellulari contaminate da micoplasma, campioni clinici con batteri commensali, o esperimenti host-pathogen dove si vuole separare las reads umane da quelle virali
Se una frazione significativa delle reads (> 5-10%) si mappa su organismi non target, è un segnale che richiede investigazione prima di procedere all’analisi differenziale
Per esperimenti standard su linee cellulari umane è uno step opzionale, ma costa poco computazionalmente e può rivelare problemi inaspettati

Downstream Analysis

Differential Expression: Finding the Genes that Matter

What it is:
- Identifying genes that are expressed at significantly different levels between two or more conditions (e.g., treated vs. control, healthy vs. diseased).
Why is it important?
- Uncover genes involved in specific biological processes or disease states.
- Identify potential biomarkers or therapeutic targets.
Tools:
- DESeq2: A popular R package for differential expression analysis.
- edgeR: Another widely used R package.
- limma: A flexible package for linear modeling and differential expression analysis.

Obiettivo: Contestualizzare l’analisi differenziale come obiettivo finale del workflow e anticipare il modulo R.

L’analisi differenziale è lo scopo principale di quasi tutti gli esperimenti RNA-seq: vogliamo sapere quali geni cambiano di espressione tra condizioni diverse (trattato vs controllo, malato vs sano)
DESeq2 è lo strumento che useremo nel corso: è considerato lo standard attuale per dati di conteggio RNA-seq, incorpora modelli statistici robusti per la normalizzazione e il test di significatività
Tenete a mente che l’analisi differenziale richiede i raw counts (conteggi non normalizzati): è per questo che Salmon produce NumReads nel quant.sf oltre al TPM
DESeq2, edgeR e limma producono risultati simili con dati di buona qualità; le differenze emergono con pochi replicati o campioni con alta varianza

Downstream Analysis

MultiQC: Your One-Stop Shop for QC Reports

What it is: A tool that aggregates results from multiple bioinformatics tools into a single HTML report.
Why is it useful for RNA-Seq?
- RNA-Seq analysis involves multiple QC steps (FastQC, SortMeRNA, Trim Galore!, etc.).
- MultiQC combines the results from these tools into a user-friendly report.
- This saves time and simplifies the QC assessment process.
Key features:
- Supports a wide range of bioinformatics tools.
- Generates interactive plots and tables for easy visualization.
- Highlights potential issues and inconsistencies in the data.
- Facilitates comparison of QC metrics across multiple samples.
Benefits:
- Saves time and effort in QC assessment.
- Improves clarity and organization of QC results.
- Facilitates data interpretation and decision-making.

Obiettivo: Presentare MultiQC come strumento che aggrega tutti i report QC della pipeline in un unico documento navigabile.

MultiQC legge automaticamente gli output di FastQC, fastp, STAR, Salmon, RSeQC, Qualimap, Picard e decine di altri tool: un solo report HTML per valutare l’intera corsa di sequenziamento
Con 10, 20 o 50 campioni, aprire report individuali è impraticabile: MultiQC permette di confrontare tutti i campioni in un colpo d’occhio e identificare outlier immediatamente
La sezione General Statistics all’inizio del report è il punto di partenza: tasso di mapping, reads totali, percentuale di duplicati e rRNA residua per tutti i campioni in una tabella
È buona pratica generare e revisionare il report MultiQC prima di procedere all’analisi differenziale: identifica problemi sistematici, campioni da escludere o asimmetrie tra gruppi sperimentali

Domande?