3 min read

Module 5: De novo genome assembly: Short reads

Overview:

Module 5 is dedicated to the exploration of de novo genome assembly using short-read sequencing data. De novo assembly is a critical process in genomics that involves reconstructing the genome from sequencing reads without the aid of a reference genome. This module will guide students through the theoretical background and practical applications of assembling bacterial genomes from short-read data, which is commonly produced by high-throughput sequencing platforms like Illumina. Students will learn how to execute de novo assembly pipelines, interpret assembly metrics, and assess the quality of the resulting genome assemblies. By the end of this module, participants will have a clear understanding of the challenges and strategies associated with short-read assembly and will be equipped with the skills to perform and analyze their own assemblies.

Topics:

Introduction to De Novo Assembly

  • Definition and importance of de novo genome assembly in bacterial genomics.
  • Challenges specific to short-read sequencing data.
  • Overview of the assembly process from raw reads to a draft genome.

Assembly Algorithms

  • Types of algorithms used for de novo assembly (greedy algorithms, overlap-layout-consensus, de Bruijn graphs).
  • Theoretical underpinnings of assembly algorithms.
  • Comparative strengths and weaknesses of different assembly approaches.

Pre-Assembly Quality Control

  • Importance of quality control for sequencing data.
  • Tools and techniques for assessing read quality (FASTP, FastQC).
  • Methods for read trimming and filtering (Trimmomatic, Cutadapt).

Running De Novo Assembly

  • Step-by-step guide to performing de novo assembly.
  • Selection and use of assembly software (Flye, SPAdes, Velvet).
  • Practical considerations for running assembly (computational requirements, parameter selection).

Optimization and Parameter Tuning

  • How to choose and adjust parameters for optimal assembly output.
  • The impact of read length, coverage, and error rates on assembly quality.
  • Troubleshooting common issues in assembly runs.

Assembly Validation and Error Correction

  • Techniques for validating the accuracy of an assembly (read mapping, scaffold validation).
  • Identifying and correcting common errors in assemblies (misassemblies, indels, substitutions).
  • Use of reference genomes for comparative validation when available.

Analysis of Assembly Summary Statistics

  • Understanding key assembly metrics (N50, L50, contig number, and total assembly length).
  • Interpreting the assembly summary statistics to assess completeness and contiguity.
  • Tools for generating and visualizing assembly statistics (QUAST, Bandage).

Comparative Assembly Analysis

  • Methods for comparing different assemblies of the same dataset.
  • Criteria for selecting the best assembly for downstream analysis.
  • The role of consensus building in resolving assembly differences.

Labs

  • Lab 1: Quality Control of Short-Read Sequencing Data
  • Lab 2: Running De Novo Assembly with Short Reads With clean and high-quality sequencing data
  • Lab 3: Analysis of Assembly Summary Statistics and Visualization

Learning Outcomes:

Upon completion of this module, students will be able to:

  • Understand the principles and challenges of de novo genome assembly using short-read sequencing data.
  • Execute de novo assembly pipelines using state-of-the-art software tools.
  • Analyze and interpret assembly summary statistics to evaluate the quality of genome assemblies.
  • Apply best practices to improve the accuracy and completeness of genome assemblies.