Overview:
Module 4 focuses on the intersection of data management and cloud computing within the context of bacterial genomics. As the volume of data generated by Whole Genome Sequencing (WGS) continues to grow, efficient data handling and computational resources become increasingly important. This module introduces students to cloud computing platforms, specifically Amazon Web Services (AWS), and teaches them how to set up and utilize cloud instances for genomic data analysis. Students will learn how to acquire sequencing data, perform quality control checks, and preprocess the data by trimming low-quality bases and adapters. By mastering these skills, students will be equipped to handle large-scale genomic datasets and leverage the power of cloud computing to conduct their analyses efficiently and effectively.
Topics:
Setting up AWS Instances:
- Introduction to cloud computing and its advantages for genomic data analysis.
- Step-by-step guide to setting up an AWS account and launching compute instances (EC2).
- Configuring the compute environment with the necessary software and tools for WGS analysis.
- Choosing the right EC2 instance type for bioinformatics applications.
- Launching and connecting to an EC2 instance via SSH.
Acquire Data:
- Methods for obtaining bacterial WGS data, including public databases like NCBI SRA and ENA.
- Techniques for transferring data to the cloud, using tools such as AWS S3, FTP, and command-line utilities.
- Organizing and managing data within the cloud environment to facilitate efficient analysis.
Quality Check and Trimming:
- The importance of quality control in WGS and the impact of sequencing quality on downstream analyses.
- Using FASTQC to assess the quality of raw sequencing reads, including quality scores, GC content, and sequence duplication levels.
- Introduction to FASTP and other preprocessing tools for trimming adapters, filtering low-quality reads, and correcting sequencing errors.
Labs
- Lab 1: Setting Up AWS Instances
- Lab 2: Acquiring and Managing Data on AWS
- Lab 3: Quality Check and Trimming of Sequencing
Learning Outcomes:
Upon completion of this module, students will be able to:
- Set up and manage AWS instances for genomic data analysis.
- Acquire and transfer WGS data from public repositories to the cloud.
- Perform quality control checks on sequencing data using FASTQC.
- Preprocess sequencing data with FASTP, including trimming adapters and filtering reads based on quality metrics.