Welcome to Bioinformatics: The Missing Semester!
Introduction
I am a lover of theory and concepts; pushing the boundaries of scientific discovery and contributing to a broader body of knowledge just for the thrill of it. But I recognize that pushing theory—though valuable—won’t adequately enable the next generation of scientists and engineers in biotech.
That said, this is intended to be practical guide to bioinformatics methods every modern bioinformatician should know, but never explicitly learned. Perhaps you’re a grad student in a computational biology lab trying to keep up, a biotech practitioner looking to refine and expand your skillset or a job seeker looking to break into the field; you’ll find value in this hands-on publication.
What to expect
The curriculum will be as follows:
Module 1: Data Types
Lesson 1: RNA transcriptomics
Lesson 2: Spatial transcriptomics
Lesson 3: Proteomics
Module 2: Multi-omics
Multimodal data integration
Module 3: Reproducibility & Scale
Lesson 1: Docker containerization
Lesson 2: Parallel computing
Lesson 3: Nextflow workflows
Module 4: AI
Language models in multi-omics
My expertise lies specifically in single-cell analysis, so we’ll be working at that resolution throughout this publication. The Missing Semester will be intentionally project-driven, making it easy for folks to reproduce and adapt code to suit their interests, enhance online portfolios or have targeted projects to share with hiring managers. The projects themselves will be relatively rudimentary to encourage a focus on methods and interpretation. The intent is also to support proficiency not only at one-off independent analyses, but at shipping reproducible, scalable code as part of a larger codebase; identical to what you would be responsible for while working on any real-world bioinformatics team at a startup or enterprise business.
I’ll include code snippets in-line where appropriate, but everything we cover will be pushed to a github repo that I will link at the end of every write-up. Should you run into any snags or if something breaks, please submit github issues as you see fit.
Everything we do will be written in Python, because it has become industry standard in biotech, particularly for end-to-end bioinformatics workflows that include machine learning and distributed compute.
Conclusion
I invite you to follow along the write-ups here on substack, but know that you won’t get the full value from this unless you open up an IDE and try running the code on your own. At first, simply try to reproduce it. Then start thinking critically about it. Critique it. Identify areas where you’d do things differently or append an idea. There are no wrong answers, only idle minds. Don’t be the latter.
I think that’s all I’ve got to say; first up: RNA transcriptomics!