Gentleman's Agreement

From WakeDEAC

Contents

Terminology and Conventions

While most folks think in terms of number of jobs, most limitations of job submission really focus on the amount of computing resources consumed (in most cases, the number of nodes/processors). While general references are made to jobs, please keep in mind we are concerned with having the resources available on a reasonable time table for people to start new jobs.

Fairshare is a term that essentially apportions a target average cluster utilization for the various user groups of the cluster. However, fairshare is only a consideration that is enforced when the cluster is overutilized. In an ideal cluster world, we have far more work to do than we have cluster resources, resulting in a lot of jobs in the queue waiting to be run. Fairsharing is a configuration parameter that helps determine which of those queued jobs to run next when a cluster resource becomes available. In an underutilized cluster environment, jobs run as the resources are available (which, in most cases, is immediately).

Use it if you need it! The cluster is there to be used. However, there are two situations where a little common courtesy may be required:

  • If no one else is using it and you need it, you should be using it! However, in order to ensure a productive environment for all users, please follow the restrictions below as they mostly apply to our typical, non-overutilized cluster environment.
  • If everyone else is using it, submit your jobs anyway! The cluster software will make sure you get your Fairshare.

Job submission

  • Jobs should not exceed 96 hours (4 days).
Exception: If your research isn't meaningful without running longer, you may run significantly longer but only a limited number of jobs on a subset of the cluster (< 100% of your fairshare allocation, for example).
  • No group should exceed their group's fair share usage by more than 25% for longer that 24 hours at a time.
Exception: During periods of extremely low cluster utilization (e.g. 10-25%), a single user may exceed this limit but the jobs over this 25% threshold should not exceed 96 hours of running. In addition, jobs MUST be submitted in a staggered fashion so that a quarter of the jobs exceeding the threshold complete every 24 hours.
Example: Your fairshare is 8. A single person can run on 10 nodes without restriction under this item. However, to run on 14 nodes for more than 24 hours, the user must stagger submission of those jobs on all 14 nodes so that 1 node is freed by job completion every 24 hours.
  • No single user should consume 75% of the available computational resources of the cluster for any period of time.
No Exceptions
  • Excessive usage of the cluster (primarily large numbers of jobs, in this instance) requires staggered submission of jobs so that the scheduling routines can work in jobs from other user within a reasonable amount of time.
Exception: You do not have to stagger submission if the number of jobs you are submitting is less than the total number of processors on the cluster (remember, I said large numbers of jobs).
Side note: You are encouraged to do so anyway if the cluster has jobs queued in order to allow the fairsharing software to work as it should.
  • Any job that will run longer than 2 days must explicitly request Ethernet nodes if Myrinet is not to be used.
No Exceptions

Head Node Usage

  • Users are not allowed to run production jobs on the cluster head nodes (deach0*). Exception: certain visualization and interactive programs can be run on the head nodes but not to exceed one instance of this program class per head node. If you are uncertain if your program falls into this class, ASK!
  • Test jobs are restricted to less than two hours of processing time and only one instance per head node. Exception: again, if you need several hours to ensure the program is generating the correct science, you can run the test job under the following conditions:
  1. Your program is "niced",
  2. Only one instance runs on any one (but not all) of the four head nodes,
  3. You do not negatively impact other users and their access to /home, /gpfs*, or /archive0.

Global Exceptions

While these limits and restrictions may appear like hard and fast rules (simply because they are written down), these limits are "gentleman's agreements" and are flexible PROVIDED that advanced warning and collaboration with your fellow users are employed. We are a friendly bunch and are willing (and have in the past) to work with each other to ensure everyone has access to the cluster (and as much as they need) for their research efforts.

Significant and urgent need for significant percentages of cluster resources can be granted pending approval of the cluster community. Simply petition the osiris-users list and make sure someone else doesn't have similar urgent needs. You can submit some of your jobs while waiting for everyone to chime in BUT you have to make sure everyone speaks up before proceeding with all of your jobs.