Make and JHPCE

What is `make`?

Make

GNU Make is a tool which controls the generation of executables and other non-source files of a program from the program’s source files.

set of instructions
builds “exectutables”, could be:
- a pdf
- a markdown document
can incorporate dependencies

Karl Broman and Make: http://kbroman.org/minimal_make/

How do use `make`?

Have a Makefile (spelled makefile or Makefile)

target: dependencies
TAB instruction 1
TAB instruction 2

What is a `target`? Example `Makefile`

A target is something you want to build. Here is an example:

all: index.html index.R 
    
index.html: index.Rmd 
    Rscript -e "rmarkdown::render('index.Rmd')"

index.R: index.Rmd
    Rscript -e "knitr::purl('index.Rmd')"

clean: 
    rm index.html index.R

all is special

How do you run this?

In the terminal:

make # will run all
make all
make clean
make index.html
make index.R

So it’s like a script, so what?

Make decides whether a target needs to be regenerated by comparing file modification times. (Wikipedia)
This solves the problem of avoiding the building of files which are already up to date, but it fails when a file changes but its modification time stays in the past.
Almost like caching, can save you a lot of time

Some catches

Make requires tabs, not spaces. Make sure your text editor for your Makefile is configured properly.
Changing directories need to be done the same line:

index.html: R/somefile.R
    cd R
    Rscript somefile.R

will not work.

Single line executions

index.html: R/somefile.R
    cd R; Rscript somefile.R

If you use the backslash, then it’s seen as a continuance of the last line.
Note, the next line still needs to be indented for make to know it’s still in the same target.

index.html: R/somefile.R 
    cd R; \
    Rscript somefile.R

Single line executions

If statements can be broken up but again need to be with continuation lines:

index.html: R/somefile.R
    if [ -e ${name}.aux ]; \
    then \
    rm ${name}.aux; \
    fi;

If you have spaces after the \, then it doesn’t work!

Splitting up dependencies

Dependencies can be broken up into multiple lines:

target: targetA targetB \
targetC targetD
  recipe

or defined on multiple lines

target: targetA targetB
target: targetC targetD

target:
  recipe

JHPCE computing and R

Setting up a passwordless login for your local machine

How do you submit an R script job to the cluster

Submitting an R script job

qsub - submits a job the queue
By default, it assumes you’re submitting a shell script that has one of 2 lines on there:

Rscript script.R
R CMD batch script.R

What’s the difference? (Someone say something here…)

Submitting an R script job

Not great if you want to just run an R script without creating a shell script
If you have a script.R file, you can execute this a few ways:
- Put #!/usr/bin/env RScript as the first line of your file
- Use R functions to help.
Rscript doesn’t load the methods package and allows for command-line arguments

What is an array job?

The `Sys.getenv` function

Sys.getenv stands for get System environmental variables
These are in .bash_profile or .bashrc, and say something like:
- export VARIABLE=something

On JHPCE cluster, SGE_TASK_ID is the task identifier for a job. It is set when you use the -t option in qsub:

qsub script.R -t 1-100

Task IDs and `expand.grid`

expand.grid: Creates a data frame from all combinations of the supplied vectors or factors.

eg = expand.grid(param1 = 1:10, param2 = c(3, 5, 6))
head(eg)

##   param1 param2
## 1      1      3
## 2      2      3
## 3      3      3
## 4      4      3
## 5      5      3
## 6      6      3

Task IDs and `expand.grid`

Grab SGE_TASK_ID:

sim_num = Sys.getenv("SGE_TASK_ID")
sim_num = as.numeric(sim_num) # Sys.getenv returns char
this_sim = eg[sim_num,]
param1 = this_sim$param1
param2 = this_sim$param2

Task IDs and `expand.grid`

SGE_TASK_ID is not set during an interactive job
So if you’re testing, add something like

sim_num = Sys.getenv("SGE_TASK_ID")
sim_num = as.numeric(sim_num) # Sys.getenv returns char
if (is.na(sim_num)) { # for testing in an interactive session
  sim_num = 1
}
this_sim = eg[sim_num,]
this_sim

##   param1 param2
## 1      1      3

Name your jobs

The -N argument can name your job qsub job.sh -N NAME
This allows you to delete a job by the name qdel NAME
The output is now NAME.o1234234 instead of job.sh.o1234234
Array jobs will have NAME.o1234234.1, NAME.o1234234.2, etc.

Holding jobs/dependent jobs

If you want a job to run only after another job is done, you can use hold_jid:

qsub job1.sh -N JOB1
qsub job2.sh -N JOB2 -hold_jid JOB1

This does not mean the job executed without error.

Holding jobs/dependent jobs

Array jobs can depend on specific tasks

qsub job1.sh -N JOB1 -t 1-100
qsub job2.sh -N JOB2 -hold_jid_ad JOB1 -t 1-100

So JOB2 with task = 2 will run when JOB1 with task = 2 is done. It doesn’t have to wait for all of JOB1 tasks.

They must have the same -t specification

Holding jobs/dependent jobs

The JOB2 array job will wait for all JOB1 to finish before starting

qsub job1.sh -N JOB1 -t 1-100
qsub job2.sh -N JOB2 -hold_jid JOB1 -t 1-100

Log into cluster, sign onto node

http://lcolladotor.github.io/2013/12/11/quick-cluster-login-to-interactive-session/

## In cluster .bashrc file

## change dir automatically when using qrsh
## Details: https://github.com/rkostadi/BiocHopkins/wiki/Useless-Tips-&-Code-Snippets
if [ -f ~/.bash_pwd ]; then
    source ~/.bash_pwd
    rm ~/.bash_pwd
fi
alias qr='echo "cd $PWD" > ~/.bash_pwd; history -w; qrsh'

Log into cluster, sign onto node

In your local file:

## In local .bashrc file
alias qr="ssh -t enigma 'source /etc/profile; echo \"cd \$PWD\" > ~/.bash_pwd; qrsh'"
## In local .bashrc file
user=jmuschel
alias enigma='ssh -Y -X ${user}@jhpce01.jhsph.edu'

BONUS: qrsh with memory

Example with mem_free=30G,h_vmem=31G:

qrgig 30 31

## In local .bashrc file
qrgig(){ 
    x="${1}G";
    y="${2}G";
    ell="-l mem_free=$x,h_vmem=$y,$3";
    echo "qrsh requests were: $ell";
    cmd="source /etc/profile; echo \"cd \$PWD\" > ~/.bash_pwd; history -w; qrsh $ell";
    enigma -t $cmd
}

What is make?

Make

How do use make?

What is a target? Example Makefile

How do you run this?

So it’s like a script, so what?

Some catches

Single line executions

Single line executions

Splitting up dependencies

JHPCE computing and R

Setting up a passwordless login for your local machine

How do you submit an R script job to the cluster

Submitting an R script job

Submitting an R script job

What is an array job?

The Sys.getenv function

Task IDs and expand.grid

Task IDs and expand.grid

Task IDs and expand.grid

Name your jobs

Holding jobs/dependent jobs

Holding jobs/dependent jobs

Holding jobs/dependent jobs

Log into cluster, sign onto node

Log into cluster, sign onto node

BONUS: qrsh with memory

What are some other cluster questions

What is `make`?

How do use `make`?

What is a `target`? Example `Makefile`

The `Sys.getenv` function

Task IDs and `expand.grid`

Task IDs and `expand.grid`

Task IDs and `expand.grid`