= Server job scheduling = This document describes proposed changes to server scheduling policies: * Feeder enumeration order * Of the jobs in shared mem, which to send to a host * Which app versions to use * What deadlines to assign These policies work in conjunction with [PortalFeatures batch scheduling policies]. == Quality of service types == We seek to handle the following types of QoS: * Non-batch, throughput-oriented, fixed latency bound. * Batches with long deadlines (greater then the turnaround time of almost all hosts). * Batches to be completed "as fast as possible" (AFAP), with no a priori deadline. * Batches with a short deadline. == Goals == The goals of the policies include: * Support the QoS features. * Avoid assigning "tight" deadlines unnecessarily, because * Doing so may make it impossible to assign tight deadlines to jobs that actually need it. * Tight deadlines often force the client to preempt other jobs, which irritates some volunteers. * Avoid long delays between the completion of a job instance and its validation. These delays irritate volunteers and increase server disk usage. * Minimize server configuration. == Host statistics == We need a way to identify hosts that can turn around jobs quickly and reliably. Notes: * This is a property of (host, app version), not host. * This is not the same as processor speed. A host may have high turnaround time for various reasons: * Large min work buffer size. * Attached to lots of other projects. * Long periods of unavailability or network disconneciton. We propose the following. For each app A: * For each (host, app version) let X be the percentile of turnaround time * For each (host, app version) let Y be the percentile of "consecutive valid results" (or +infinity if > 10) over all active hosts and all current app versions. * Let P(H, AV) = min(X, Y) This will be computed periodically (say, 24 hours) by a utility program. Notes: * When a new app version is deployed, the host_app_version records for the previous version should be copied, on the assumption that hosts reliable for on version will be reliable for the next. == Batch completion estimation == The proposed policies require estimates C(B) of batch completion, I'm not sure exactly how to compute these, but * it should be based on completed and validated jobs rather than a priori FLOPs estimates. * it should reflect (host, app version) information (e.g. turnaround and elapsed time statistics) for the hosts that have completed jobs, and for the host population as a whole * They should be computed by a daemon process, triggered by the passage time and by the validation of jobs in the batch. Notes: * C(B) is different from the "logical end time" of the batch used in batch scheduling. * For long-deadline batches, C(B) should probably be at least the original delay bound plus the greatest dispatch time of first instance of a job. I.e. if it takes a long time to dispatch the first instances, adjust the deadline accordingly to avoid creating a deadline crunch. == Proposed feeder policy == Proposed enumeration order: {{{ (LET(J) asc, nretries desc) }}} where LET(J) is the logical end time of the job's batch. == Proposed scheduler policy == For each processor type T (CPU and GPU) we have a "busy time" BT: the time already committed to high-priority jobs. For a given job J we can compute the estimated runtime R. The earliest we can expect to finish J is then BT + R, so that's the earliest deadline we can assign. Call this MD(J, T). For each app A and processor type T, compute the best app version BAV(A, T) at the start of handling each request. The rough policy then is: {{{ for each job J in the array, belonging to batch B for each usable app version AV of type T if B is AFAP an there's no estimate yet if P(H, AV) > 50% send J using AV, with deadline BT + R else x = MD(J, T) if x < C(B) send J using AV, with deadline C(B) else if P(H, AV) > 90% send J using AV, with deadline x }}} Make an initial pass through the array sending only jobs that have a percentile requirement. Notes * The 50% and 90% can be parameterized * Retries are not handled differently at this level, although we could add a restriction like sending them only to top 50% hosts * In the startup case (e.g. new app) no hosts will be high percentile. How to avoid starvation? * I think that score-based scheduling is now deprecated. The feasibility and/or desirability of a job may depend on what other jobs we're sending, so it doesn't make sense to assign it a score in isolation. It's simpler to scan jobs and make a final decision for each one. There are a few properties we need to give priority to: * Limited locality scheduling * beta jobs We can handle these in separate passes, as we're doing now.