Introduzione all’RNASeq
Introduzione all’analisi RNASeq in R
Dipartimento di Biomedicina e Prevenzione
Marco Chiapello, Revelo Datalab
Nomber 2024

Pre-processing and Quality Control

Raw reads: The Starting Point
Definition
The FASTA format is a text-based format for representing either nucleotide sequences or amino acid sequences. Both nucleotides and amino acids are represented using single-letter codes. The first line before the nucleotide/amino acid sequence contains the name of the sequence, preceded by the “>” symbol.
File extension could be: fasta, fna, ffn, faa, fa, frn
Raw reads: The Starting Point
Definition
The FASTA format is a text-based format for representing either nucleotide sequences or amino acid sequences. Both nucleotides and amino acids are represented using single-letter codes. The first line before the nucleotide/amino acid sequence contains the name of the sequence, preceded by the “>” symbol.
>NM_001404729.1 Oryza sativa ribulose bisphosphate carboxylase small chain A
CTCAACAGCACTGCTACTGGACATACTCTACTACTACTAGCCAGTAAGCTAGCTAACTAACTACGTGGCT
ATGGCCCCCACCGTGATGGCCTCCTCGGCCACCTCCGTGGCTCCATTCCAAGGGCTCAANNNNNNNNNNN
Raw reads: The Starting Point
Multi-fasta file:
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK
>NM_001404729.1 Oryza sativa ribulose bisphosphate carboxylase small chain A
CTCAACAGCACTGCTACTGGACATACTCTACTACTACTAGCCAGTAAGCTAGCTAACTAACTACGTGGCT
ATGGCCCCCACCGTGATGGCCTCCTCGGCCACCTCCGTGGCTCCATTCCAAGGGCTCAANNNNNNNNNNN
>HF583486.1 Homo sapiens SOD1 gene for alternative protein SOD1, isolate 144496
ATGGATTCCATGTTCATGAGTTTGGAGATAATACAGCAGGCTGTACCAGTGCAGGTCCTCACTTTAATCC
TCTATCCAGAAAACACGGTGGGCCAAAGGATGAAGAGAGGCATGTTGGAGACTTGGGCAATGTGA
Raw reads: The Starting Point
Definition
The FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.
Raw reads: The Starting Point
Definition
The FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.
A FASTQ file is identified by 4 lines:
Raw reads: The Starting Point
Definition
The FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Raw reads: The Starting Point
The quality score is used to identify the probability of the correct identification of the corresponding nucleotide.
Raw reads: The Starting Point
The quality score is used to identify the probability of the correct identification of the corresponding nucleotide.

Raw reads: The Starting Point
FASTQ files can be very large, so in most cases you won’t be dealing with myFile.fastq, but with myFile.fastq.gz.
This means that the file is compressed, so it is not readable by a text editor without first being decompressed.

Pre-processing Tools

Pre-processing Tools
UMI-tools extract do?


Pre-processing Tools
cutadapt (adapter trimming) and often includes FastQC (quality assessment).
Pre-processing Tools
Pre-processing Tools
Alignment and Quantification

Alignment and Quantification

Alignment and Quantification
Alignment and Quantification
@RG ID:1 SM:C5926_BM_IonCode_0118
@PG ID:samtools PN:samtools VN:1.16.1 CL:samtools view -H C5926_BM_IonCode_0118.reassembled.bam
Alignment and Quantification



Alignment and Quantification

Alignment and Quantification


Alignment and Quantification

UMI-tools dedup helps to correct for this bias and improve accuracy.
Alignment and Quantification

Pseudo-aligners
A New Approach to Quantification

Pseudo-aligners
Pseudo-aligners
| Feature | Kallisto ![]() |
Salmon ![]() |
|---|---|---|
| Algorithm | Pseudoalignment (k-mers) | Quasi-mapping (BWT) |
| Speed | Fast | Faster |
| Memory Usage | Lower | Higher |
| Features | Basic | More advanced (library types, bootstrapping) |
| Accuracy | High | High |
| Ease of Use | High | High |
| Output Formats | Abundance estimates | Abundance estimates, counts, TPM, etc. |
| Applications | Gene expression, DE analysis | Gene expression, DE analysis, isoform quantification |
Pseudo-aligners


Post-alignment Processing

Post-alignment Processing
Post-alignment Processing
samtools sort: Sorts alignments in a BAM file by coordinate.
samtools index: Creates an index file for a sorted BAM file.
samtools stats: Generates statistics about an alignment file.
Post-alignment Processing
Post-alignment Processing
Downstream Analysis

Downstream Analysis
Downstream Analysis
Downstream Analysis
Downstream Analysis

Downstream Analysis
DESeq2: A popular R package for differential expression analysis.edgeR: Another widely used R package.limma: A flexible package for linear modeling and differential expression analysis.Downstream Analysis

Domande?