Any way to manually change the deadline of a task?

Author	Message
robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1287	Message 102005 - Posted: 6 Dec 2020, 18:10:56 UTC - in response to Message 102004. Since I'm running with single-core per task, four tasks at a time I can't answer your question until those ones have finished (another 8 to start) so you will have to wait. The figure I'm talking about is the "p_flops" as described by Richard the other day, again, until I'm doing multi-core tasks we won't see what's going on in a controlled situation. One thing that may be clouding your situation is your inverted use of the cache - the use of zero for the "store at least" may well be causing some issues, in the past when I used zero I had some very strange things with the cache going into a feeding frenzy so I reverted to a positive number - I've been using a one or two day "store at least" figure with a very small "store additional" (currently 0.01 days). The way the cache system works is quite simple, the "store x-days" is the amount of work you want in your cache, the "store additional y-days" is, when y is very small, effectively how often your computer will check to see if you need any more work, as y approaches and exceeds x it behaves more as an additional store, but still determines how often work will be called for (unless the main cache is approaching empty) - I've got a feeling there is a divide by x in the way the size of the work call is established, and divide by zero is an overflow unless there is adequate trapping and thus setting to some default value. Personally I'd rather see the work fetch size being under my control rather than that of an unknown other who has no idea of what I am trying to achieve. While we are trying to get to the bottom of this just leave things as they are and wait for my computer to start running 2-core jobs for a couple of days, then 4-core jobs for another couple of days. ID: 102005 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5094	Message 102008 - Posted: 6 Dec 2020, 19:44:35 UTC I've been trying to make more sense of it, but I haven't got completely to the bottom of it. It's clear that the jobs issued by PrimeGrid are 'too small' - their server thinks they will take less time than they really will. Those 170 MT jobs that started all this running - the Primegrid server sends them out with an estimate of 92 seconds, and Peter's computer 'corrects' that to 11:55. The tools for doing that - DCF, or 'duration correction factor' - are old. They're kept in the client as a legacy, for use when talking to an old server. Old servers know how to handle them. But PrimeGrid is using a relatively modern server. It reports 'server version 713', which dates it to about the middle of 2018. The old 'DCF' legacy code has been stripped out of the current code, and I suspect that happened longer ago than 2018 - possibly as long ago as 2012. I'll keep looking. Since I doubt we're ever going to get that old code back, it would be better if the PrimeGrid staff adjusted the sizes of their tasks to make them more realistic. ID: 102008 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1287	Message 102015 - Posted: 7 Dec 2020, 16:17:53 UTC - in response to Message 102013. Do you know anyone in Primegrid to pass this on to? At the foot of Prime Grid's home page (in very small text) is this message: [Return to PrimeGrid main page] DNS Powered by DNSEXIT.COM Copyright © 2005 - 2020 Rytis Slatkevičius (contact) and PrimeGrid community. Server load 5.47, 5.44, 5.57 Generated 7 Dec 2020 \| 16:00:23 UTC and "(contact)" - which is a hyper link - expands to http://www.primegrid.com/contact.php So I would try that as the first port of call. Rytis Slatkevičius is described elsewhere on the site as the project administrator, so if nothing else he should know who to pass your comments on to. (Unlike a number of other projects Prime Grid doesn't appear to have a list of project scientists in an obvious place.) btw - A quick scan of the various forum & message boards that PrimeGrid offers suggests that there are a number of other users who are "enjoying" the same, or similar, problems as we are with excessive numbers of tasks being sent, and the level of help compared with that Richard has proffered is "minimal"....... ID: 102015 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5094	Message 102016 - Posted: 7 Dec 2020, 16:31:45 UTC - in response to Message 102015. Rytis is a familiar name from BOINC mailing lists, and is definitely a good person to approach. And yes, by "adjusted the sizes of their tasks" I was referring to the estimated size conveyed in <rsc_fpops_est>, but I was trying to be less technical in that post. ID: 102016 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5094	Message 102025 - Posted: 7 Dec 2020, 22:36:48 UTC Well, my plucky little Celeron finished its four tasks tonight, very close to the target time. Racked up a DCF of 8.8362. I've now got a batch of 49 Sophie Germain LLR MTs, reckoned to use all four cores for 13 minutes each. Of course, they're showing up at 8.8 times that, or 01:55:13 each - nearly four days, for a work fetch of 0.01 + 0.1 days. But it looks like the first one will be done in about 20 minutes total. I'll see how many I have left in the morning. ID: 102025 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1287	Message 102033 - Posted: 8 Dec 2020, 15:07:40 UTC Provided the project has set up things correctly then the work delivered to the user should be the same as the work requested. BUT: 08/12/2020 14:49:33 \| PrimeGrid \| work fetch resumed by user 08/12/2020 14:49:35 \| PrimeGrid \| [sched_op] Starting scheduler request 08/12/2020 14:49:38 \| PrimeGrid \| Sending scheduler request: To fetch work. 08/12/2020 14:49:38 \| PrimeGrid \| Requesting new tasks for CPU 08/12/2020 14:49:38 \| PrimeGrid \| [sched_op] CPU work request: 349056.00 seconds; 4.00 devices [/b]08/12/2020 14:49:38 \| PrimeGrid \| [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices 08/12/2020 14:49:39 \| PrimeGrid \| Scheduler request completed: got 16 new tasks 08/12/2020 14:49:39 \| PrimeGrid \| [sched_op] Server version 713 08/12/2020 14:49:39 \| PrimeGrid \| Project requested delay of 7 seconds 08/12/2020 14:49:39 \| PrimeGrid \| [sched_op] estimated total CPU task duration: 1832873 seconds 08/12/2020 14:49:39 \| PrimeGrid \| [sched_op] estimated total NVIDIA GPU task duration: 0 seconds 08/12/2020 14:49:39 \| PrimeGrid \| [sched_op] Deferring communication for 00:00:07 08/12/2020 14:49:39 \| PrimeGrid \| [sched_op] Reason: requested by project 08/12/2020 14:49:42 \| PrimeGrid \| Starting task pps_sr2sieve_138086289_2 08/12/2020 14:49:42 \| PrimeGrid \| Starting task pps_sr2sieve_138087015_0 08/12/2020 14:49:42 \| PrimeGrid \| Starting task pps_sr2sieve_138087023_0 08/12/2020 14:49:42 \| PrimeGrid \| Starting task pps_sr2sieve_137805566_2 The work request is about what I would expect (1 day's worth of work, and four cores in use, but for unknown reasons still only using 1 core per task.... So what did I get? 16 tasks, each with an estimated runtime of about 20 hours, on one core per task - that means about four days of work in hand - not as big a multiplier as you've been seeing, but still an excess of work. (But the initial estimated runtime is 509 hour, so that's well out of order..........) Put it simply - Prime Grid are being "somewhat generous with the truth" (other phrases might be used). ID: 102033 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5094	Message 102034 - Posted: 8 Dec 2020, 15:49:30 UTC And last night's work finished here, too. I asked for a smaller amount, but I've switched to MT work, using all four cores at once - so only one task can run at a time. The replacement fetch was: 08/12/2020 14:34:50 \| PrimeGrid \| [sched_op] CPU work request: 38016.00 seconds; 4.00 devices 08/12/2020 14:34:51 \| PrimeGrid \| Scheduler request completed: got 49 new tasks 08/12/2020 14:34:51 \| PrimeGrid \| [sched_op] estimated total CPU task duration: 59566 seconds The work fetch request was for 0.11 days (9,504, times 4 cores, equals 38,016 - correct). The reply was: (speed) <flops>8339247311 (size) <rsc_fpops_est>6525083998674 =782.45 seconds per task times 49 new tasks = 38,340 seconds (first task to go above request - correct) times 1.5503 client DCF = 59,439 (close enough to client estimate - correct) Each task is estimated to run for 1,213 seconds, or just over 20 minutes - that's what the manager shows, and that's what's happening in reality. Correct. So far, so good. BUT: The client request was expressed in cache per core - so four times the requested cache in wall-time. The reply is also expressed as estimated per core, and would be right if each separate core was working on a separate task. The 49 tasks would finish in 12 cycles of 20 minutes, or four hours. But the estimated speed is four times greater than the speed for a single core - it's the aggregated speed of the whole CPU. And the tasks are running singly, using all four cores. So they'll run for 49 cycles of 20 minutes, or over 16 hours. I call *bug* in the MT part of the server calculation. That old 'size / speed' is in mixed units - '(size per core) / (speed per CPU)' Now to find it in the code... (or perhaps run it past Rytis first) ID: 102034 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5094	Message 102036 - Posted: 8 Dec 2020, 16:14:03 UTC As an aside, and keeping it separate from the problems at PrimeGrid, the old server code at Einstein is useful for showing how it should be done. 2020-12-08 14:35:42.7720 [PID=4585 ] [send] Intel GPU: req 4338.97 sec, 0.00 instances; est delay 0.00 2020-12-08 14:35:42.7720 [PID=4585 ] [send] active_frac 0.999999 on_frac 0.999721 DCF 4.452887 2020-12-08 14:35:42.8467 [PID=4585 ] [version] Best version of app hsgamma_FGRPB1G is 1.22 ID 1169 FGRPopencl-intel_gpu (125.91 GFLOPS) 2020-12-08 14:35:42.8467 [PID=4585 ] [send] [HOST#8864187] [WU#508575212 LATeah1066L43_500.0_0_0.0_29778798] using delay bound 1209600 (opt: 1209600 pess: 1209600) 2020-12-08 14:35:42.8483 [PID=4585 ] [send] [HOST#8864187] Sending app_version 1169 hsgamma_FGRPB1G 9 122 FGRPopencl-intel_gpu; 125.91 GFLOPS 2020-12-08 14:35:42.8504 [PID=4585 ] [send] est. duration for WU 508575212: unscaled 4169.56 scaled 18571.80 2020-12-08 14:35:42.8505 [PID=4585 ] [HOST#8864187] Sending [RESULT#1041546941 LATeah1066L43_500.0_0_0.0_29778798_0] (est. dur. 18571.80 seconds, delay 1209600, deadline 1608647742) 2020-12-08 14:35:42.8518 [PID=4585 ] [send] don't need more work That's an extract from the server log visible at Einstein. Some terms need explaining: 'req' is the request from my machine - how much it wanted. 'DCF' we've come across before. It's calculated by the client, and reported to the server. 'delay bound' is how long the deadline is going to be - 14 days, for these tasks. 'duration' is the server's calculation of how long it will take. 'unscaled' is the raw estimate from speed, 'scaled' takes account of all the fractions, including DCF. The 'unscaled' figure is less than my request, so on that basis it would have gone on to find another task. But the 'scaled' figure is more than enough, so it didn't need to. The modern version of that second line is at https://github.com/BOINC/boinc/blob/master/sched/sched_send.cpp#L1498 log_messages.printf(MSG_NORMAL, "[send] on_frac %f active_frac %f gpu_active_frac %f\n", g_reply->host.on_frac, g_reply->host.active_frac, g_reply->host.gpu_active_frac No mention of DCF. Likewise, the 'scaled' runtime estimate ignores DCF. ID: 102036 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5094	Message 102561 - Posted: 14 Jan 2021, 15:11:23 UTC Now that everyone has returned refreshed after the holidays (??!!), I've opened communications with PrimeGrid and had some preliminary discussion with Rytis. They've made some small changes to their private code which allows people to specify a lower level of participation in the MT tasks than "100% of available CPUs". I'm pretty sure that there is a generic flaw in the basic server code used by every BOINC project, but PrimeGrid's is so customised that it's hard to provide convincing proof. Does anyone know of a project which (1) uses 'close to standard' server code, and (2) provides a reliable supply of consistent MT tasks? ID: 102561 ·

Dave Help desk expert Send message Joined: 28 Jun 10 Posts: 2579	Message 102562 - Posted: 14 Jan 2021, 15:21:26 UTC - in response to Message 102561. MT tasks=multi-threaded? ID: 102562 ·

Dr Who Fan Send message Joined: 10 May 07 Posts: 1372	Message 102563 - Posted: 14 Jan 2021, 16:19:10 UTC - in response to Message 102562. MT tasks=multi-threaded? Yes. ID: 102563 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15507	Message 102564 - Posted: 14 Jan 2021, 16:56:11 UTC - in response to Message 102561. Milkyway@Home maybe? ID: 102564 ·

Dave Help desk expert Send message Joined: 28 Jun 10 Posts: 2579	Message 102565 - Posted: 14 Jan 2021, 17:11:57 UTC - in response to Message 102563. MT tasks=multi-threaded? Yes. Thanks, mind went blank for a while and just wanted to confirm. ID: 102565 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5094	Message 102566 - Posted: 14 Jan 2021, 17:15:32 UTC - in response to Message 102564. Last modified: 14 Jan 2021, 17:26:19 UTC Milkyway@Home maybe? It's worth a thought. Their N-body tasks (the only MT ones) were in total chaos a few years ago, but I remember someone saying they'd cleaned up their act. All my other projects have given up on me (GPUGrid - end of research run; NumberFields - crashed hard disk; SETI - went to meet its maker; ...) Oh, ********. Server status page says "Upstream server release: 1.0.4", scheduler reply says "[sched_op] Server version 713". So which, and when, is it? I thought I'd asked CERN to sort that one out. ID: 102566 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15507	Message 102567 - Posted: 14 Jan 2021, 17:17:58 UTC - in response to Message 102566. Last modified: 14 Jan 2021, 17:19:27 UTC I don't think they mapped out the whole Milkyway yet. Edit: and their server version looks pretty much bog standard: https://github.com/BOINC/boinc/tree/server_release/1.0/1.0.4 ID: 102567 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5094	Message 102568 - Posted: 14 Jan 2021, 17:25:49 UTC - in response to Message 102567. See edit! ID: 102568 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.