What is make?

Make

GNU Make is a tool which controls the generation of executables and other non-source files of a program from the program’s source files.

  • set of instructions
  • builds “exectutables”, could be:
    • a pdf
    • a markdown document
  • can incorporate dependencies

Karl Broman and Make: http://kbroman.org/minimal_make/

How do use make?

  • Have a Makefile (spelled makefile or Makefile)
target: dependencies
TAB instruction 1
TAB instruction 2

What is a target? Example Makefile

  • A target is something you want to build. Here is an example:
all: index.html index.R 
    
index.html: index.Rmd 
    Rscript -e "rmarkdown::render('index.Rmd')"

index.R: index.Rmd
    Rscript -e "knitr::purl('index.Rmd')"

clean: 
    rm index.html index.R
  • all is special

How do you run this?

In the terminal:

make # will run all
make all
make clean
make index.html
make index.R

So it’s like a script, so what?

  • Make decides whether a target needs to be regenerated by comparing file modification times. (Wikipedia)

  • This solves the problem of avoiding the building of files which are already up to date, but it fails when a file changes but its modification time stays in the past.

  • Almost like caching, can save you a lot of time

Some catches

  • Make requires tabs, not spaces. Make sure your text editor for your Makefile is configured properly.

  • Changing directories need to be done the same line:

index.html: R/somefile.R
    cd R
    Rscript somefile.R

will not work.

Single line executions

index.html: R/somefile.R
    cd R; Rscript somefile.R

If you use the backslash, then it’s seen as a continuance of the last line.
Note, the next line still needs to be indented for make to know it’s still in the same target.

index.html: R/somefile.R 
    cd R; \
    Rscript somefile.R

Single line executions

If statements can be broken up but again need to be with continuation lines:

index.html: R/somefile.R
    if [ -e ${name}.aux ]; \
    then \
    rm ${name}.aux; \
    fi;

If you have spaces after the \, then it doesn’t work!

Splitting up dependencies

Dependencies can be broken up into multiple lines:

target: targetA targetB \
targetC targetD
  recipe

or defined on multiple lines

target: targetA targetB
target: targetC targetD

target:
  recipe

JHPCE computing and R

Setting up a passwordless login for your local machine

How do you submit an R script job to the cluster

Submitting an R script job

  • qsub - submits a job the queue
  • By default, it assumes you’re submitting a shell script that has one of 2 lines on there:
Rscript script.R
R CMD batch script.R

What’s the difference? (Someone say something here…)

Submitting an R script job

  • Not great if you want to just run an R script without creating a shell script
  • If you have a script.R file, you can execute this a few ways:
  • Rscript doesn’t load the methods package and allows for command-line arguments

What is an array job?

The Sys.getenv function

  • Sys.getenv stands for get System environmental variables
  • These are in .bash_profile or .bashrc, and say something like:
    • export VARIABLE=something

On JHPCE cluster, SGE_TASK_ID is the task identifier for a job. It is set when you use the -t option in qsub:

qsub script.R -t 1-100

Task IDs and expand.grid

  • expand.grid: Creates a data frame from all combinations of the supplied vectors or factors.
eg = expand.grid(param1 = 1:10, param2 = c(3, 5, 6))
head(eg)
##   param1 param2
## 1      1      3
## 2      2      3
## 3      3      3
## 4      4      3
## 5      5      3
## 6      6      3

Task IDs and expand.grid

Grab SGE_TASK_ID:

sim_num = Sys.getenv("SGE_TASK_ID")
sim_num = as.numeric(sim_num) # Sys.getenv returns char
this_sim = eg[sim_num,]
param1 = this_sim$param1
param2 = this_sim$param2

Task IDs and expand.grid

  • SGE_TASK_ID is not set during an interactive job
  • So if you’re testing, add something like
sim_num = Sys.getenv("SGE_TASK_ID")
sim_num = as.numeric(sim_num) # Sys.getenv returns char
if (is.na(sim_num)) { # for testing in an interactive session
  sim_num = 1
}
this_sim = eg[sim_num,]
this_sim
##   param1 param2
## 1      1      3

Name your jobs

  • The -N argument can name your job qsub job.sh -N NAME

  • This allows you to delete a job by the name qdel NAME
  • The output is now NAME.o1234234 instead of job.sh.o1234234
  • Array jobs will have NAME.o1234234.1, NAME.o1234234.2, etc.

Holding jobs/dependent jobs

  • If you want a job to run only after another job is done, you can use hold_jid:
qsub job1.sh -N JOB1
qsub job2.sh -N JOB2 -hold_jid JOB1
  • This does not mean the job executed without error.

Holding jobs/dependent jobs

  • Array jobs can depend on specific tasks
qsub job1.sh -N JOB1 -t 1-100
qsub job2.sh -N JOB2 -hold_jid_ad JOB1 -t 1-100

So JOB2 with task = 2 will run when JOB1 with task = 2 is done. It doesn’t have to wait for all of JOB1 tasks.

  • They must have the same -t specification

Holding jobs/dependent jobs

  • The JOB2 array job will wait for all JOB1 to finish before starting
qsub job1.sh -N JOB1 -t 1-100
qsub job2.sh -N JOB2 -hold_jid JOB1 -t 1-100

Log into cluster, sign onto node

Log into cluster, sign onto node

In your local file:

## In local .bashrc file
alias qr="ssh -t enigma 'source /etc/profile; echo \"cd \$PWD\" > ~/.bash_pwd; qrsh'"
## In local .bashrc file
user=jmuschel
alias enigma='ssh -Y -X ${user}@jhpce01.jhsph.edu'

BONUS: qrsh with memory

Example with mem_free=30G,h_vmem=31G:

qrgig 30 31
## In local .bashrc file
qrgig(){ 
    x="${1}G";
    y="${2}G";
    ell="-l mem_free=$x,h_vmem=$y,$3";
    echo "qrsh requests were: $ell";
    cmd="source /etc/profile; echo \"cd \$PWD\" > ~/.bash_pwd; history -w; qrsh $ell";
    enigma -t $cmd
}

What are some other cluster questions