Changes between Version 36 and Version 37 of CreditNew


Ignore:
Timestamp:
May 11, 2012, 9:16:19 AM (12 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CreditNew

    v36 v37  
    22
    33== Terminology ==
     4
     5 * The '''runtime''' (or '''elapsed time''') of a job is the
     6   amount of time it runs.
     7 * '''FLOPs''' (lower case s) means number of floating-point operations.
     8 * '''FLOPS''' (upper case S) means FLOPs per second.
    49
    510BOINC estimates the '''peak FLOPS''' of each processor.
     
    2126 * For our purposes, the peak FLOPS of a device
    2227   is based on single or double precision, whichever is higher.
    23 
    24 == Credit system goals ==
    25 
    26 Some goals in designing a credit system:
    27  * Device neutrality: similar jobs should get similar credit
    28    regardless of what processor or GPU they run on.
    29  * Project neutrality: different projects should grant
    30    about the same amount of credit per host, averaged over all hosts.
    31  * Gaming-resistance: there should be a bound on the
    32    impact of faulty or malicious hosts.
     28 * BOINC's estimate of the peak FLOPS of a device may be wrong,
     29   e.g. because the manufacturer's formula is incomplete or wrong.
    3330
    3431== The first credit system ==
     
    9087   (This means that projects with efficient GPU apps will
    9188   grant more credit than projects with inefficient apps.  That's OK).
     89 * Cheat-resistance.
    9290
    9391== ''A priori'' job size estimates and bounds ==
     
    9593For each job, the project supplies
    9694 * an estimate of the FLOPs used by a job (wu.fpops_est)
    97  * a limit on FLOPS, after which the job will be aborted
     95 * a limit on FLOPs, after which the job will be aborted
    9896  (wu.fpops_bound).
    9997
     
    104102Averages of FLOP count and elapsed time
    105103are normalized by fpops_est (see below),
    106 and if fpops_est is correlated with actual size,
     104and if fpops_est is correlated with runtime,
    107105these averages will converge more quickly.
    108106
     
    122120based on the resources used by the job and their peak speeds.
    123121
    124 When the job is finished in elapsed time T,
     122When a client finishes a job and reports its elapsed time T,
    125123we define peak_flop_count(J), or PFC(J) as
    126124
     
    175173   is above a '''sample threshold'''.
    176174
    177 == Data ==
    178 
    179 We maintain the following estimates:
    180 
    181  app.min_avg_pfc:: an estimate of the average actual FLOPS for the app
    182    (normalized by wu.fpops_est)
    183  app_version.pfc_avg:: the average of PFC(J)/wu.fpops_est for an app version.
    184  app_version.pfc_scale:: a PFC scale factor for the app version
     175== Statistics maintained by the server ==
     176
     177The server maintains the following statistics:
     178
    185179 host_app_version.pfc_avg:: for each app version V and host H,
    186180   the average of PFC(J)/wu.fpops_est for jobs completed by H using A.
    187  host_app_version.scale_probation::
    188    if set, the host is suspected of cherry-picking (see below)
    189    and we don't use host normalization
     181 app_version.pfc_avg:: the average of PFC(J)/wu.fpops_est for all jobs
     182   completed by the app version.
    190183
    191184== Sanity check ==
    192185
    193 If PFC(J) is infinite or is > wu.fpops_bound,
    194 J is assigned a "default PFC" D and other processing is skipped.
     186If PFC(J) is > wu.fpops_bound,
     187J is assigned a "default PFC" D and it's not used to update statistics.
    195188D is determined as follows:
    196189
     
    202195
    203196   D = wu.fpops_est
    204 
    205 We also set host_app_version.scale_probation to true
    206 (ensuring that the host scale factor isn't used for a while)
    207 and host_app_version.error_rate to an initial value
    208 (ensuring that jobs sent to this host are replicated for a while).
    209197
    210198== Cross-version normalization ==
     
    243231
    244232Notes:
    245  * Doesn't host normalization (see below) subsume version normalization?
    246    Not if there are both CPU and GPU versions, because of the "min".
    247233 * Version normalization is only applied if at least two
    248234   versions are above sample threshold.
     
    260246   One solution is to create separate apps for separate types of jobs.
    261247 * Cheating or erroneous hosts can influence app_version.pfc_avg to some extent.
    262    This is limited by the Sanity Check mechanism,
     248   This is limited by the "sanity check" mechanism,
    263249   and by the fact that only validated jobs are used.
    264250   The effect on credit will be negated by host normalization
     
    277263
    278264 app_version.pfc_avg / host_app_version.pfc_avg
     265
     266This scaling is only done if both statistics are above sample threshold.
    279267
    280268There are some cases where hosts are not sent jobs uniformly:
     
    309297If app.min_avg_pfc is defined,
    310298host_app_version.pfc_avg is above sample threshold,
    311 and host_app_version.scale_probation is not set,
    312299we normalize PFC by the factor
    313300
     
    562549(from which job durations are estimated).
    563550
    564 == Job runtime estimates ==
    565 
    566 Unrelated to the credit proposal, but in a similar spirit.
    567 The server will maintain host_app_version.et,
    568 the statistics (mean and variance) of
    569 job runtimes (normalized by wu.fpops_est) per
    570 host and application version.
    571 
    572 The server's estimate of a job's runtime is then
    573 
    574  R(J, H) = wu.fpops_est * host_app_version.et.avg
    575 
    576551== Implementation ==
    577552
     
    653628 * If we're the "main feeder" (mod = 0, or mod not used),
    654629   update app_version.pfc_scale and app.min_avg_pfc every 10 minutes.
     630