Learning through An Example#
We will walk through a quick end-to-end flow for testing and debugging a variant calling workflow to demonstrate how to use latch develop.
Building a Simple Variant Calling Workflow#
In this tutorial, we will be building a variant calling workflow. To follow along, clone the example code here:
git clone https://github.com/hannahle/simple-variant-calling.git cd simple-variant-calling
The repository consists of three folders:
buggy-wf: The workflow with bugs
good-wf: The final, functional workflow
wgs: Test data for the workflow
We will use
latch develop to test and debug the buggy-wf to arrive at good-wf effectively.
Let’s get started!
Preparing test data#
First, we must upload our test data folder,
wgs, to the Latch Platform.
Run the following command to upload the
wgs data from the terminal
$ latch cp wgs latch:///wgs
Once the data has finished uploading, verify whether it exists on Latch by using
latch ls like so:
$ latch ls Size Date Modified Name - - welcome/ - - wgs/
Overview of the variant calling workflow#
The data we are working with is part of a long-term evolution experiment by Richard Lenski, designed to assess the adaptation of E. coli in various environments.
Variant calling is a typical workflow used to observe the change in a population over successive generations. We can use this to analyze how this experiment’s E. coli_ population changed over time relative to the original population, E. coli strain REL606. To do so, we will align each of our samples to the original E.coli strain’s (REL606) reference genome to determine the differences between our reads after 40,000 generations versus the original genome.
Our variant calling pipeline will consist of five steps:
Index the reference genome
Align our reads to the reference genome
Convert our aligned reads from SAM to BAM format
Sort our BAM file by coordinates
Perform variant calling
Hence, our workflow will contain five tasks:
We have provided some (buggy) code for the five tasks above for you in the
buggy-wf, which you will now test and debug!
Testing and Debugging the Workflow#
First, enter the
buggy-wf folder and register it to Latch:
$ ls buggy-wf good-wf wgs $ cd buggy-wf $ latch register --remote .
We must register the workflow before we can debug it with
latch develop. The registration process builds the code’s environment, which is the key to successfully debugging your workflow. Now we can interact with the environment.
Entering the environment#
latch develop . in the workflow directory to enter the workflow environment.
We recommend reading this overview of the latch develop environment before proceeding.
Defining a test script#
Before testing the workflow end-to-end, it is helpful to run and test each task individually. To get started, we have provided a testing directory called
scripts with a
main.py inside containing commented-out task test code.
$ tree . . ├── Dockerfile ├── README.md ├── scripts │ └── main.py ├── version └── wf └── __init__.py 3 directories, 6 files
Uncomment some of the code in
main.py to get this:
# Import task functions from our workflow from wf import build_index, align_reads, convert_to_bam, sort_bam, variant_calling from latch.types import LatchFile, LatchDir # Call task function build_index(ref_genome = LatchFile("latch:///wgs/ref_genome/ecoli_rel606.fasta"))
The first line imports all tasks defined in
wf/__init__.pyso we can reference them in this script.
The second line imports the necessary Latch types.
The third line calls the task function to index the reference genome,
To use a file on Latch as test data for the task, navigate to the Latch Console, click on the specific file and copy the
path shown on the sidebar. After copying the path, prefix it with
latch:// to specify that it is a file on Latch, and
pass the whole string as a parameter to
Calling the test script#
Now that we have modified our local code, we can run it in the development environment. Note that local changes are automatically reflected in the development environment.
>>> python3 scripts/main.py ... FileNotFoundError: [Errno 2] No such file or directory: 'bwa'
The logs tell us that there is no file or directory called
bwa. One potential reason is that we might not have
installed the binary
>>> bwa bash: bwa: command not found
bwa binary is not installed! Checking the Dockerfile, notice that the installation instruction for
is commented out. Let’s uncomment it:
... RUN apt-get install bwa ...
Because we modified our Dockerfile, we must rebuild the environment and enter a new development session to load in the newest changes. First, exit the current development session:
>>> exit Exiting local development session
Re-register the workflow with the new Docker image
$ latch register --remote . ...
Now enter a new development session and re-run the test script:
>>> python3 scripts/main.py
The script should now run successfully!
Where are my outputs?#
To ensure that our tasks are working correctly, let’s look at their output files to ensure they’re correct.
To locate the output files, we can inspect the return statement of the
build_index task inside
@small_task def build_index(ref_genome: LatchFile = LatchFile("latch:///wgs/ref_genome/ecoli_rel606.fasta")) -> LatchDir: _bwa_cmd = [ "bwa", "index", ref_genome.local_path, ] subprocess.run(_bwa_cmd) output = Path(ref_genome.local_path).resolve().parent return LatchDir(output, "latch:///wgs/ref_genome")
We see that the task is returning a
LatchDir with the remote path
latch:///wgs/ref_genome, which indicates that the
output files are inside the
/wgs/ref_genome folder in the Latch Console.
Debugging Subsequent Tasks#
Similarly to how we tested our first task, we can also call the second task from
scripts/main.py like so:
from wf import build_index, align_reads, convert_to_bam, sort_bam, variant_calling from latch.types import LatchFile, LatchDir build_index(ref_genome=LatchFile("latch:///wgs/ref_genome/ecoli_rel606.fasta")) align_reads( ref_genome_dir=LatchDir("latch:///wgs/ref_genome/"), read1=LatchFile("latch:///wgs/data/SRR2584863_1.trim.sub.fastq"), read2=LatchFile("latch:///wgs/data/SRR2584863_2.trim.sub.fastq"), )
Here, we are passing the output of the first task,
latch:///wgs/ref_genome, as the input to the second task.
Run the test script:
>>> python3 scripts/main.py ... Original exception: failed to get presigned url for `latch:///wgs/trimmed_fastqs/SRR2584863_1.trim.sub.fastq`
This error tells us that there is no file called
/wgs/trimmed_fastqs/SRR2584863_1.trim.sub.fastq in the Latch
Console. Referencing the actual file paths of the trimmed FastQs, we can see that their paths are indeed wrong, with
the correct paths being
/wgs/data/SRR2584863_1.trim.sub.fastq, etc (listed under
/wgs/data instead of
We can make this modification to our test script and re-run the task as below:
>>> python3 scripts/main.py
The task now outputs the results to the folder
/results on Latch!
As an optional exercise, continue debugging the workflow’s final three tasks. The debugged workflow code is provided in the
good-wf folder for reference.
How to open a development session with
How local changes are synced to the development environment
How to reflect changes from your
Dockerfilein the latch develop environment