66 | | |
67 | | A job sent to a client is associated with an app version, |
68 | | which uses some number (possibly fractional) of CPUs, |
69 | | and some number of instances of a particular coprocessor type. |
70 | | |
71 | | == Scheduler request and reply message == |
72 | | |
73 | | New fields in the scheduler request message: |
74 | | |
75 | | '''double cpu_req_secs''':: number of CPU seconds requested |
76 | | '''double cpu_req_instances''':: send enough jobs to occupy this many CPUs |
77 | | |
78 | | And for each coprocessor type: |
79 | | |
80 | | '''double req_secs''':: number of instance-seconds requested |
81 | | '''double req_instances''':: send enough jobs to occupy this many instances |
82 | | |
83 | | The semantics: a scheduler should send jobs for a resource type |
84 | | only if the request for that type is nonzero. |
85 | | |
86 | | For compatibility with old servers, the message still has '''work_req_seconds''', |
87 | | which is the max of the req_seconds. |
88 | | |
89 | | == Per-resource-type backoff == |
90 | | |
91 | | We need to handle the situation where e.g. there's a GPU shortfall |
92 | | but no projects are supplying GPU work |
93 | | (for either permanent or transient reasons). |
94 | | We don't want an overall work-fetch backoff from those projects. |
95 | | |
96 | | Instead, we maintain a separate backoff timer per (project, resource type). |
97 | | The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work; |
| 69 | Currently there are two resource types: CPU and NVIDIA GPUs. |
| 70 | |
| 71 | Summary of the new policy: it's like the old policy, |
| 72 | but with a separate copy for each resource type, |
| 73 | and scheduler requests can now ask for work for particular resource types. |
| 74 | |
| 75 | === Per-resource-type backoff === |
| 76 | |
| 77 | We need to keep track of whether projects have work for particular |
| 78 | resource types, |
| 79 | so that we don't keep asking them for types of work they don't have. |
| 80 | |
| 81 | To do this, we maintain a separate backoff timer per (project, resource type). |
| 82 | The backoff interval is doubled up to a limit (1 day) |
| 83 | whenever we ask for work of that type and don't get any work; |
147 | | |
148 | | == Client data structures == |
149 | | |
150 | | === RSC_WORK_FETCH === |
151 | | |
152 | | Work-fetch state for a particular resource types. |
153 | | There are instances for CPU ('''cpu_work_fetch''') and NVIDIA GPUs ('''cuda_work_fetch'''). |
154 | | Data members: |
155 | | |
156 | | '''ninstances''':: number of instances of this resource type |
157 | | |
158 | | Used/set by rr_simulation()): |
159 | | |
160 | | '''double shortfall''':: shortfall for this resource |
161 | | '''double nidle''':: number of currently idle instances |
162 | | |
163 | | Member functions: |
164 | | |
165 | | '''rr_init()''':: called at the start of RR simulation. Compute project shares for this PRSC, and clear overall and per-project shortfalls. |
166 | | '''set_nidle()''':: called by RR sim after initial job assignment. |
167 | | Set nidle to # of idle instances. |
168 | | '''accumulate_shortfall()''':: called by RR sim for each time interval during work buf period. |
169 | | {{{ |
170 | | shortfall += dt*(ninstances - instances in use) |
171 | | for each project p not backed off for this PRSC |
172 | | p->PRSC_PROJECT_DATA.accumulate_shortfall(dt) |
173 | | }}} |
174 | | |
175 | | '''select_project()''':: select the best project to request this type of work from. It's the project not backed off for this PRSC, and for which LTD + p->shortfall is largest, also taking into consideration overworked projects etc. |
176 | | |
177 | | '''accumulate_debt(dt)''':: |
178 | | for each project p: |
179 | | {{{ |
180 | | x = insts of this device used by P's running jobs |
181 | | y = P's share of this device |
182 | | update P's LTD |
183 | | }}} |
184 | | |
185 | | === RSC_PROJECT_WORK_FETCH === |
186 | | |
187 | | State for a (resource type, project pair). |
188 | | It has the following "persistent" members (i.e., saved in state file): |
189 | | |
190 | | '''backoff_interval''':: how long to wait until ask project for work specifically for this PRSC; |
191 | | double this any time we ask for work for this rsc and get none (maximum 24 hours). Clear it when we ask for work for this PRSC and get some job. |
192 | | '''backoff_time''':: back off until this time |
193 | | '''debt''': long term debt |
194 | | |
195 | | And the following transient members (used by rr_simulation()): |
196 | | |
197 | | '''double runnable_share''':: # of instances this project should get based on resource share |
198 | | relative to the set of projects not backed off for this PRSC. |
199 | | '''instances_used''':: # of instances currently being used |
200 | | |
201 | | === PROJECT_WORK_FETCH === |
202 | | |
203 | | Per-project work fetch state. |
204 | | Members: |
205 | | '''overall_debt''':: weighted sum of per-resource debts |
206 | | |
207 | | === WORK_FETCH === |
208 | | |
209 | | Overall work-fetch state. |
210 | | |
211 | | '''PROJECT* choose_project()''':: choose a project from which to fetch work. |
212 | | |
213 | | * Do round-robin simulation |
214 | | * if a GPU is idle, choose a project to ask for that type of work (using RSC_WORK_FETCH::choose_project()) |
215 | | * if a CPU is idle, choose a project to ask for CPU work |
216 | | * if GPU has a shortfall, choose a project to ask for GPU work |
217 | | * if CPU has a shortfall, choose a project to ask for CPU work |
218 | | In the case where a resource type was idle, ask for only that type of work. |
| 129 | === Summary of the new policy === |
| 130 | |
| 131 | Every 60 seconds, and when various events happen (e.g. jobs finish), |
| 132 | the following is done. |
| 133 | CI is the "connect interval" preference; |
| 134 | AW is the "additional work" preference. |
| 135 | |
| 136 | Auxiliary functions: |
| 137 | |
| 138 | '''get_major_shortfall(resource)''' |
| 139 | |
| 140 | If the resource will have an idle instance before CI, |
| 141 | return the greatest-overall-debt non-backed-off project P |
| 142 | (P may be overworked). Otherwise return NULL. |
| 143 | |
| 144 | '''get_minor_shortfall(resource)''' |
| 145 | |
| 146 | If the resource will have an idle instance between CI and CI+AW, |
| 147 | return the greatest-overall-debt non-backed-off non-overworked project P |
| 148 | |
| 149 | '''get_starved_project(resource)''' |
| 150 | |
| 151 | If any project is not overworked, not backed off, and has no runnable jobs |
| 152 | for any resource, return the one with greatest overall debt |
| 153 | |
| 154 | Main logic: |
| 155 | * Do a round-robin simulation of currently queued jobs. |
| 156 | * p = get_major_shortfall(NVIDIA GPU); if p <> NULL, ask it for work and return |
| 157 | * ... same for other coprocessor types (we assume that coprocessors are faster, hence more imporant, than CPU) |
| 158 | * ... same, for CPU |
| 159 | * p = get_minor(shortfall(NVIDIA GPU); if p <> NULL, ask it for work and return |
| 160 | * ... same for other coprocessor types, then CPU |
| 161 | * p = get_starved_project(NVIDIA GPU); if p <> NULL, ask it for work and return |
| 162 | * ... same for other coprocessor types, then CPU |
| 163 | |
| 164 | In the get_major_shortfall() case, ask only for work of that resource type. |
221 | | == Scheduler changes == |
| 167 | == Implementation notes == |
| 168 | |
| 169 | A job sent to a client is associated with an app version, |
| 170 | which uses some number (possibly fractional) of CPUs, |
| 171 | and some number of instances of a particular coprocessor type. |
| 172 | |
| 173 | === Scheduler request and reply message === |
| 174 | |
| 175 | New fields in the scheduler request message: |
| 176 | |
| 177 | '''double cpu_req_secs''':: number of CPU seconds requested |
| 178 | '''double cpu_req_instances''':: send enough jobs to occupy this many CPUs |
| 179 | |
| 180 | And for each coprocessor type: |
| 181 | |
| 182 | '''double req_secs''':: number of instance-seconds requested |
| 183 | '''double req_instances''':: send enough jobs to occupy this many instances |
| 184 | |
| 185 | The semantics: a scheduler should send jobs for a resource type |
| 186 | only if the request for that type is nonzero. |
| 187 | |
| 188 | For compatibility with old servers, the message still has '''work_req_seconds''', |
| 189 | which is the max of the req_seconds. |
| 190 | |
| 191 | === Client data structures === |
| 192 | |
| 193 | RSC_WORK_FETCH:: The work-fetch state for a particular resource type. There are instances for CPU ('''cpu_work_fetch''') and NVIDIA GPUs ('''cuda_work_fetch'''). |
| 194 | RSC_PROJECT_WORK_FETCH:: The work-fetch state for a (resource type, project pair). |
| 195 | PROJECT_WORK_FETCH:: Per-project work fetch state. |
| 196 | WORK_FETCH:: Overall work-fetch state. |
| 197 | |
| 198 | === Scheduler changes === |