Quick start guide running Nextflow on HPC

Link to code on github
After learning nextflow at one of the Seqera sessions, I decided to try running nextflow on my university’s HPC and made a some tweaks to the original tutorial as an extra ‘challenge’ to test my understanding.

Setting up the environment

Load the Java Module:

Check available Java versions: module avail.
Load your desired Java version: module load java-19.

Verify Java Installation:

Check the Java version: java -version.

Set JAVA_HOME:

After loading Java, set JAVA_HOME: export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java)))).
Add the commands above to ~/.bashrc or ~/.bash_profile to make it permanent.
Honestly I wasn’t sure if this step was absolutely necessary for nextflow but most Java-based applications and development tools might point to JAVA_HOME so I might as well.

Autoload Java each session:

To auto-load Java in every session, add the module load command to ~/.bashrc or ~/.bash_profile: echo "module load java-11" >> ~/.bashrc.
Apply changes: source ~/.bashrc or log out and back in.

Install Nextflow by following the documentation

Kumusta Mundo, Selamat Pagi, God Morgen

At the training session, I learned that nextflow uses a combination of Groovy and shell scripts. I followed the hello-world tutorial here but added some code to process two files by taking two inputs: --input_file and --lang_file instead of just one in the original tutorial.

In nextflow, a process represents a single task or step within a larger workflow. Each process is a self-contained unit that performs a specific action, such as running a script, processing a file, or executing a command. A workflow is a larger function that defines how multiple processes are connected and interact with each other. It executes multiple processes, specifying the order and conditions under which each process runs. In this case, we have sayHello and toUpper that are called inside the workflow.

🚨

Addendum I realized that I wasn’t supposed to run any tasks at the login node; I should start a container/pod using a launch script (out of the box by the Research Cluster or customized). Once I start a container, they can be accessible via command line or an interactive environment like Jupyter Notebook. I’m filing this github repo away for future reference to building a Custom Image for UCSD DataHub/DSMLP

🎯

The goal of this workflow is to take in two file inputs, and output two files for each line in the inputs. There are 3 lines in each of the input files, and we want to: i) create one output file with a single line (e.g. tagalog-output.txt with kumusta mundo as its content) and ii) create an upper-cased version of the first output (e.g. upper-tagalog-output.txt with KUMUSTA MUNDO as its content).

Nextflow pipeline

// Sets a parameter for the output file name, defaulting to 'output.txt'
params.output_file = 'output.txt'

// Defines a process named 'sayHello'
process sayHello {
    input:
        // Takes two input values: 'greeting' and 'lang'
        val greeting
        val lang

    output:
        // Specifies the output file path using the input 'lang' and the 'params.output_file' parameter
        path "${lang}-${params.output_file}"

    // Script block with Unix-like commands
    // Writes the 'greeting' value to a file named as per 'lang' and 'params.output_file'
    """
    echo '$greeting' > '$lang-$params.output_file'
    """
}

// Defines another process named 'toUpper'
process toUpper {

    input:
        // Takes a file path as input
        path input_file

    output:
        // Specifies the output file path, prefixing the input file name with 'upper-'
        path "upper-${input_file}"

    // Script block with Unix-like commands
    // Reads the 'input_file', converts its content to upper case, and writes to a new file
    """
    cat $input_file | tr '[a-z]' '[A-Z]' > upper-${input_file}
    """
}

// Defines the workflow
workflow {
    // Creates a channel from the file specified in 'params.input_file' and splits its content into lines
    greeting_ch = Channel.fromPath(params.input_file).splitText() { it.trim() }

    // Creates another channel from the file specified in 'params.lang_file' and splits its content into lines
    lang_ch = Channel.fromPath(params.lang_file).splitText() {it.trim()}

    // Calls the 'sayHello' process with 'greeting_ch' and 'lang_ch' as inputs
    sayHello(greeting_ch, lang_ch)

    // Calls the 'toUpper' process with the output of 'sayHello' process as its input
    toUpper(sayHello.out) 
}


selamat pagi
magandang umaga
god morgen

greetings.txt

malay
tagalog
norwegian

languages.txt

Run the Nextflow pipeline

nextflow run test.nf --input_file "greetings.txt" --lang_file "languages.txt" ansi_log false

Nextflow uses ANSI escape codes in its terminal logging to enhance readability with color and interactivity. However, these features can be less useful and clutter logs with plain text ANSI characters in non-interactive contexts. Nextflow allows disabling rich ANSI logging for cleaner, plain text output, so we’ll do that.

The output from running the commands above indicate that there were 6 tasks in total, which is correct.sayHello process has completed all tasks (3 out of 3), and so did toUpper.

N E X T F L O W  ~  version 23.10.1
Launching test.nf [prickly_gates] DSL2 - revision: bc2825233c
executor >  local (6)
[39/0941f9] process > sayHello (2) [100%] 3 of 3 ✔
[5f/666175] process > toUpper (1)  [100%] 3 of 3 ✔

One of the more mystifying things for me was the numbers next to processes do not add up to the total number of tasks i.e., (2) and (1). It appears that Nextflow allows for parallel execution of tasks, so if a process is designed to work on multiple data items independently, it can launch several tasks in parallel, each counted separately in the total task count. The numbers next to process names in Nextflow's log output are counts of process invocations, which do not correspond to the total task count.

We see the successfully that we have produced outputs from the commands.

Other tips:

command.sh: This file contains the actual command executed by Nextflow. Review it to verify that the command was interpreted and executed as intended.
.exitcode: This file holds the exit code of the command. An exit code of 0 indicates successful execution. Any other number suggests an error or issue occurred.
command.out: This file captures the output produced by the command. Check here to see what the command generated or returned during its execution.

hello-world.nf on the research cluster

Quick start guide running Nextflow on HPC

Setting up the environment

Kumusta Mundo, Selamat Pagi, God Morgen

Nextflow pipeline

Run the Nextflow pipeline