Message boards :
Questions and problems :
Any way to manually change the deadline of a task?
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 25 May 09 Posts: 1287 |
Since I'm running with single-core per task, four tasks at a time I can't answer your question until those ones have finished (another 8 to start) so you will have to wait. The figure I'm talking about is the "p_flops" as described by Richard the other day, again, until I'm doing multi-core tasks we won't see what's going on in a controlled situation. One thing that may be clouding your situation is your inverted use of the cache - the use of zero for the "store at least" may well be causing some issues, in the past when I used zero I had some very strange things with the cache going into a feeding frenzy so I reverted to a positive number - I've been using a one or two day "store at least" figure with a very small "store additional" (currently 0.01 days). The way the cache system works is quite simple, the "store x-days" is the amount of work you want in your cache, the "store additional y-days" is, when y is very small, effectively how often your computer will check to see if you need any more work, as y approaches and exceeds x it behaves more as an additional store, but still determines how often work will be called for (unless the main cache is approaching empty) - I've got a feeling there is a divide by x in the way the size of the work call is established, and divide by zero is an overflow unless there is adequate trapping and thus setting to some default value. Personally I'd rather see the work fetch size being under my control rather than that of an unknown other who has no idea of what I am trying to achieve. While we are trying to get to the bottom of this just leave things as they are and wait for my computer to start running 2-core jobs for a couple of days, then 4-core jobs for another couple of days. |
Send message Joined: 5 Oct 06 Posts: 5094 |
I've been trying to make more sense of it, but I haven't got completely to the bottom of it. It's clear that the jobs issued by PrimeGrid are 'too small' - their server thinks they will take less time than they really will. Those 170 MT jobs that started all this running - the Primegrid server sends them out with an estimate of 92 seconds, and Peter's computer 'corrects' that to 11:55. The tools for doing that - DCF, or 'duration correction factor' - are old. They're kept in the client as a legacy, for use when talking to an old server. Old servers know how to handle them. But PrimeGrid is using a relatively modern server. It reports 'server version 713', which dates it to about the middle of 2018. The old 'DCF' legacy code has been stripped out of the current code, and I suspect that happened longer ago than 2018 - possibly as long ago as 2012. I'll keep looking. Since I doubt we're ever going to get that old code back, it would be better if the PrimeGrid staff adjusted the sizes of their tasks to make them more realistic. |
Send message Joined: 25 May 09 Posts: 1287 |
Do you know anyone in Primegrid to pass this on to? At the foot of Prime Grid's home page (in very small text) is this message: [Return to PrimeGrid main page]
and "(contact)" - which is a hyper link - expands to http://www.primegrid.com/contact.php So I would try that as the first port of call. Rytis Slatkevičius is described elsewhere on the site as the project administrator, so if nothing else he should know who to pass your comments on to. (Unlike a number of other projects Prime Grid doesn't appear to have a list of project scientists in an obvious place.) btw - A quick scan of the various forum & message boards that PrimeGrid offers suggests that there are a number of other users who are "enjoying" the same, or similar, problems as we are with excessive numbers of tasks being sent, and the level of help compared with that Richard has proffered is "minimal"....... |
Send message Joined: 5 Oct 06 Posts: 5094 |
Rytis is a familiar name from BOINC mailing lists, and is definitely a good person to approach. And yes, by "adjusted the sizes of their tasks" I was referring to the estimated size conveyed in <rsc_fpops_est>, but I was trying to be less technical in that post. |
Send message Joined: 5 Oct 06 Posts: 5094 |
Well, my plucky little Celeron finished its four tasks tonight, very close to the target time. Racked up a DCF of 8.8362. I've now got a batch of 49 Sophie Germain LLR MTs, reckoned to use all four cores for 13 minutes each. Of course, they're showing up at 8.8 times that, or 01:55:13 each - nearly four days, for a work fetch of 0.01 + 0.1 days. But it looks like the first one will be done in about 20 minutes total. I'll see how many I have left in the morning. |
Send message Joined: 25 May 09 Posts: 1287 |
Provided the project has set up things correctly then the work delivered to the user should be the same as the work requested. BUT: 08/12/2020 14:49:33 | PrimeGrid | work fetch resumed by user The work request is about what I would expect (1 day's worth of work, and four cores in use, but for unknown reasons still only using 1 core per task.... So what did I get? 16 tasks, each with an estimated runtime of about 20 hours, on one core per task - that means about four days of work in hand - not as big a multiplier as you've been seeing, but still an excess of work. (But the initial estimated runtime is 509 hour, so that's well out of order..........) Put it simply - Prime Grid are being "somewhat generous with the truth" (other phrases might be used). |
Send message Joined: 5 Oct 06 Posts: 5094 |
And last night's work finished here, too. I asked for a smaller amount, but I've switched to MT work, using all four cores at once - so only one task can run at a time. The replacement fetch was: 08/12/2020 14:34:50 | PrimeGrid | [sched_op] CPU work request: 38016.00 seconds; 4.00 devicesThe work fetch request was for 0.11 days (9,504, times 4 cores, equals 38,016 - correct). The reply was: (speed) <flops>8339247311 (size) <rsc_fpops_est>6525083998674 =782.45 seconds per task times 49 new tasks = 38,340 seconds (first task to go above request - correct) times 1.5503 client DCF = 59,439 (close enough to client estimate - correct) Each task is estimated to run for 1,213 seconds, or just over 20 minutes - that's what the manager shows, and that's what's happening in reality. Correct. So far, so good. BUT: The client request was expressed in cache per core - so four times the requested cache in wall-time. The reply is also expressed as estimated per core, and would be right if each separate core was working on a separate task. The 49 tasks would finish in 12 cycles of 20 minutes, or four hours. But the estimated speed is four times greater than the speed for a single core - it's the aggregated speed of the whole CPU. And the tasks are running singly, using all four cores. So they'll run for 49 cycles of 20 minutes, or over 16 hours. I call ***bug*** in the MT part of the server calculation. That old 'size / speed' is in mixed units - '(size per core) / (speed per CPU)' Now to find it in the code... (or perhaps run it past Rytis first) |
Send message Joined: 5 Oct 06 Posts: 5094 |
As an aside, and keeping it separate from the problems at PrimeGrid, the old server code at Einstein is useful for showing how it should be done. 2020-12-08 14:35:42.7720 [PID=4585 ] [send] Intel GPU: req 4338.97 sec, 0.00 instances; est delay 0.00That's an extract from the server log visible at Einstein. Some terms need explaining: 'req' is the request from my machine - how much it wanted. 'DCF' we've come across before. It's calculated by the client, and reported to the server. 'delay bound' is how long the deadline is going to be - 14 days, for these tasks. 'duration' is the server's calculation of how long it will take. 'unscaled' is the raw estimate from speed, 'scaled' takes account of all the fractions, including DCF. The 'unscaled' figure is less than my request, so on that basis it would have gone on to find another task. But the 'scaled' figure is more than enough, so it didn't need to. The modern version of that second line is at https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L1498 log_messages.printf(MSG_NORMAL, "[send] on_frac %f active_frac %f gpu_active_frac %f\n", g_reply->host.on_frac, g_reply->host.active_frac, g_reply->host.gpu_active_fracNo mention of DCF. Likewise, the 'scaled' runtime estimate ignores DCF. |
Send message Joined: 5 Oct 06 Posts: 5094 |
Now that everyone has returned refreshed after the holidays (??!!), I've opened communications with PrimeGrid and had some preliminary discussion with Rytis. They've made some small changes to their private code which allows people to specify a lower level of participation in the MT tasks than "100% of available CPUs". I'm pretty sure that there is a generic flaw in the basic server code used by every BOINC project, but PrimeGrid's is so customised that it's hard to provide convincing proof. Does anyone know of a project which (1) uses 'close to standard' server code, and (2) provides a reliable supply of consistent MT tasks? |
Send message Joined: 28 Jun 10 Posts: 2579 |
MT tasks=multi-threaded? |
Send message Joined: 10 May 07 Posts: 1372 |
MT tasks=multi-threaded? Yes. |
Send message Joined: 29 Aug 05 Posts: 15507 |
Milkyway@Home maybe? |
Send message Joined: 28 Jun 10 Posts: 2579 |
MT tasks=multi-threaded? Thanks, mind went blank for a while and just wanted to confirm. |
Send message Joined: 5 Oct 06 Posts: 5094 |
Milkyway@Home maybe?It's worth a thought. Their N-body tasks (the only MT ones) were in total chaos a few years ago, but I remember someone saying they'd cleaned up their act. All my other projects have given up on me (GPUGrid - end of research run; NumberFields - crashed hard disk; SETI - went to meet its maker; ...) Oh, ********. Server status page says "Upstream server release: 1.0.4", scheduler reply says "[sched_op] Server version 713". So which, and when, is it? I thought I'd asked CERN to sort that one out. |
Send message Joined: 29 Aug 05 Posts: 15507 |
I don't think they mapped out the whole Milkyway yet. Edit: and their server version looks pretty much bog standard: https://github.com/BOINC/boinc/tree/server_release/1.0/1.0.4 |
Send message Joined: 5 Oct 06 Posts: 5094 |
See edit! |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.