Appendix L: Running MAD Simulations on HTCondor

HTCondor is a workload management system that facilitates high-throughput computing on a cluster of networked computers. Using this system, a pool of computers can be configured so that batch processes are executed when computing resources are not being used by the machine’s owner. This allows for a large series of simulations to be split into smaller jobs that can be executed by a number of machines at the same time, thereby reducing the amount of time needed to run the simulation. This section is not meant to provide details on HTCondor, but rather to present information relevant to the use of HTCondor for MAD simulations. For more information on HTCondor, visit their website.

R packages on HTCondor

The R-based drivers require packages that are not always installed on each executing machine in the HTCondor pool. In order to ensure that each executing machine has all the required packages to run the R driver, the user needs to create a .7z file with the packages required. If machines in the HTCondor pool have different versions of R statistics, then the user needs to create a .7z file for each version of R statistics installed in the pool.

How to create a .7z file for an R-based driver:

R statistics creates a folder in Documents/R/win-library/X.XX  for each version of R installed on the machine. In each X.XX folder, there are directories with the packages installed on the machine. The user should select the packages required and create a .7z file using 7-zip. Usually, the R packages need other packages to run. Those packages also should be included in the .7z archive.

List of packages required for R-based drivers:

-R-Gstat driver:

                Package: gstat.

-R_Base driver:

                Packages: mvtnorm, msm, tmvtnorm

Tip: In order to determine other packages necessary to run those listed above, cut and paste all packages of the X.XX folder to another temporary folder. Install the required package (e.g. gstat) in R (using Packages/Install package(s)). R will locate and install other packages necessary to run the requested package. Then check the X.XX folder again to see which additional packages were installed.

For convenience, a .7z archive containing all packages necessary to run the R-Gstat driver in MAD#, with R version 3.1, is available for download from the Codeplex site in the Downloads section.

Running MAD jobs on multiple HTCondor pools:

MAD simulations can be run on multiple HTCondor pools then combined afterward for postprocessing. The information entered into the Submit project to HTCondor determines the names of the samples produced by the condor pool. For example, if one enters

Initial sample per job: 1

Final sample per job: 2

Number of jobs: 100

200 samples will be produced, which will be named sample1 – sample200. Because of this, running simulations on multiple HTCondor pools requires the user to specify sample names in such a way that they can be renamed and combined afterward. Staying with the above example, the user could submit jobs to another HTCondor pool with the following specified:

Initial sample per job: 201

Final sample per job: 202

Number of jobs: 100

The first pool would create sample1 – sample200. The second pool would create sample201 – sample400. When Check output is clicked after all samples are returned, the user will have samples 1-400.

Requirements of the pool:

Although HTCondor pools can support a variety of architectures and operating systems, the current version of MAD requires that computers be running 64-bit Windows in order to be used for simulations. Machines in the pool not meeting these requirements will be ignored.

It is important that each 64-bit Windows computer on the pool has Microsoft Visual C++ 2010 Redistributable Package (x86) installed. If any machine does not have it installed, it can be found here.

Authenticating submitters on the pool:

In order to submit MAD simulations to a pool, the submitting user must first be authenticated on the pool. To perform a one-time user authentication:

  • Enter condor_status in the command line of the submitting machine and ensure that the machine is properly connected to the pool.
  • Enter condor_store_cred add into the command line of the submitting machine.
  • When prompted, type the Windows password of the user.

This process must be completed for each user that wishes to submit jobs to the pool. It only needs to be done once, as the user will then remain authenticated on the pool.

Helpful condor commands:

HTCondor does not have a GUI and user interactions are all through the command line. Here are some helpful commands for monitoring and managing MAD simulations running on HTCondor:

  • condor_status: Check the state of all resources in the pool. “Unclaimed” slots are available to execute jobs. “Claimed” slots are already being used by HTCondor to run a batch process. “Owner” indicates that the resource is being used locally by the machine owner.
  • condor_q: Check the list of jobs that have been submitted to the HTCondor pool from the current machine.
  • condor_q –global: Check the list of jobs that have been submitted to the HTCondor pool from all submitters in the pool.
  • condor_q –submitter [submitter name]: Check the list of jobs that have been submitted to the HTCondor pool from a specific submitter in the pool.
  • condor_history: Check the list of all jobs that have been completed for the current machine.

Last edited Aug 18, 2014 at 11:38 PM by segej87, version 5

Comments

No comments yet.