Message boards : GPUs : One Nvidia GPU unable to process after a couple of days.
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Oct 21 Posts: 7 |
I have the following issue in that One of the two NVIDIA GPUS no longer computer after a couple of days. The only way to get work again is if i remove all the tasks or reset the project but I have to do this every 2 - 3 days. Here are the specs. This happens on a couple of machines. Project: Einstein@Home GPUs: 2 RTX 2070s. CPU: Intel I7 / Gen 8, with 8 GB ram. Boinc Version: 7.16.20 There are plenty of GPU tasks enabled. I have turned on coproc_debug, cpu_sched_debug and work_fetch_debug options. Here is the event log. Noticed that device is not able to run because CPU is committed but I have set the CPU limit to 70% so I thought there would be plenty of CPU head room. I tried setting it lower makes no different. 5/17/2022 9:39:12 AM | | Re-reading cc_config.xml 5/17/2022 9:39:12 AM | | Config: GUI RPCs allowed from: 5/17/2022 9:39:12 AM | | 172.16.0.23 5/17/2022 9:39:12 AM | | Config: use all coprocessors 5/17/2022 9:39:12 AM | | log flags: file_xfer, task, coproc_debug, cpu_sched_debug, work_fetch_debug 5/17/2022 9:39:12 AM | | [cpu_sched_debug] Request CPU reschedule: Core client configuration 5/17/2022 9:39:12 AM | | [work_fetch] Request work fetch: Core client configuration 5/17/2022 9:39:12 AM | | [cpu_sched_debug] schedule_cpus(): start 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] reserving 1.000000 of coproc NVIDIA 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: LATeah3012L08_796.0_0_0.0_32509764_0 (NVIDIA GPU, FIFO) (prio -1.000000) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] reserving 1.000000 of coproc NVIDIA 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: LATeah3012L08_796.0_0_0.0_32507847_0 (NVIDIA GPU, FIFO) (prio -1.020764) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3800_0 (CPU, EDF) (prio -1.041527) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3552_0 (CPU, EDF) (prio -1.041562) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3560_0 (CPU, EDF) (prio -1.041597) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3088_0 (CPU, EDF) (prio -1.041632) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1752_0 (CPU, EDF) (prio -1.041667) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] add to run list: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1896_0 (CPU, EDF) (prio -1.041702) 5/17/2022 9:39:12 AM | | [cpu_sched_debug] enforce_run_list(): start 5/17/2022 9:39:12 AM | | [cpu_sched_debug] preliminary job list: 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 0: LATeah3012L08_796.0_0_0.0_32509764_0 (MD: no; UTS: yes) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 1: LATeah3012L08_796.0_0_0.0_32507847_0 (MD: no; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 2: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3800_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 3: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3552_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 4: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3560_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 5: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3088_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 6: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1752_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 7: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1896_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | | [cpu_sched_debug] final job list: 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 0: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3800_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 1: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3552_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 2: p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3560_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 3: p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3088_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 4: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1752_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 5: p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1896_0 (MD: yes; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 6: LATeah3012L08_796.0_0_0.0_32509764_0 (MD: no; UTS: yes) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] 7: LATeah3012L08_796.0_0_0.0_32507847_0 (MD: no; UTS: no) 5/17/2022 9:39:12 AM | Einstein@Home | [coproc] NVIDIA instance 0; 1.000000 pending for LATeah3012L08_796.0_0_0.0_32509764_0 5/17/2022 9:39:12 AM | Einstein@Home | [coproc] NVIDIA instance 0: confirming 1.000000 instance for LATeah3012L08_796.0_0_0.0_32509764_0 5/17/2022 9:39:12 AM | Einstein@Home | [coproc] Assigning NVIDIA instance 1 to LATeah3012L08_796.0_0_0.0_32507847_0 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3800_0 (high priority) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3552_0 (high priority) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.44-01.82.S.b4s0g0.00000_3560_0 (high priority) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.31-01.59.S.b6s0g0.00000_3088_0 (high priority) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1752_0 (high priority) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling p2030.20180616.G55.31-01.59.S.b1s0g0.00000_1896_0 (high priority) 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] scheduling LATeah3012L08_796.0_0_0.0_32509764_0 5/17/2022 9:39:12 AM | Einstein@Home | [cpu_sched_debug] skipping GPU job LATeah3012L08_796.0_0_0.0_32507847_0; CPU committed 5/17/2022 9:39:12 AM | | [cpu_sched_debug] enforce_run_list: end 5/17/2022 9:39:14 AM | | choose_project(): 1652805554.692413 5/17/2022 9:39:14 AM | | [work_fetch] ------- start work fetch state ------- 5/17/2022 9:39:14 AM | | [work_fetch] target work buffer: 86400.00 + 86400.00 sec 5/17/2022 9:39:14 AM | | [work_fetch] --- project states --- 5/17/2022 9:39:14 AM | Einstein@Home | [work_fetch] REC 607807.647 prio -0.104 can't request work: scheduler RPC backoff (14.56 sec) 5/17/2022 9:39:14 AM | | [work_fetch] --- state for CPU --- 5/17/2022 9:39:14 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 1317305.84 busy 1072572.89 5/17/2022 9:39:14 AM | Einstein@Home | [work_fetch] share 0.000 5/17/2022 9:39:14 AM | | [work_fetch] --- state for NVIDIA GPU --- 5/17/2022 9:39:14 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 278830.81 busy 0.00 5/17/2022 9:39:14 AM | Einstein@Home | [work_fetch] share 0.000 5/17/2022 9:39:14 AM | | [work_fetch] ------- end work fetch state ------- 5/17/2022 9:39:14 AM | Einstein@Home | choose_project: scanning 5/17/2022 9:39:14 AM | Einstein@Home | skip: scheduler RPC backoff 5/17/2022 9:39:14 AM | | [work_fetch] No project chosen for work fetch 5/17/2022 9:39:29 AM | | [work_fetch] Request work fetch: Backoff ended for Einstein@Home 5/17/2022 9:39:29 AM | | choose_project(): 1652805569.760385 5/17/2022 9:39:29 AM | | [work_fetch] ------- start work fetch state ------- 5/17/2022 9:39:29 AM | | [work_fetch] target work buffer: 86400.00 + 86400.00 sec 5/17/2022 9:39:29 AM | | [work_fetch] --- project states --- 5/17/2022 9:39:29 AM | Einstein@Home | [work_fetch] REC 607807.647 prio -1.104 can request work 5/17/2022 9:39:29 AM | | [work_fetch] --- state for CPU --- 5/17/2022 9:39:29 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 1317242.70 busy 1072565.50 5/17/2022 9:39:29 AM | Einstein@Home | [work_fetch] share 1.000 5/17/2022 9:39:29 AM | | [work_fetch] --- state for NVIDIA GPU --- 5/17/2022 9:39:29 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 278828.81 busy 0.00 5/17/2022 9:39:29 AM | Einstein@Home | [work_fetch] share 1.000 5/17/2022 9:39:29 AM | | [work_fetch] ------- end work fetch state ------- Anyone have suggestions to fix this? I have tried talking to the Einstein@home people and didn't get to far with them. Thanks, Bob |
Send message Joined: 25 May 09 Posts: 1295 |
Lines like: 5/17/2022 9:39:12 AM | Einstein@Home | [coproc] NVIDIA instance 0; 1.000000 pending for LATeah3012L08_796.0_0_0.0_32509764_0 I've highlighted the key section of the line. A few things Are both GPUs actually working - use something like GPU-Z to check that Is BOINC actualy seeing both GPUs - in the BOINC log, when first start BOINC you should see lines a bit like these: 18/05/2022 07:44:19 | | Starting BOINC client version 7.16.20 for windows_x86_64 Again the key bits are highlighted. Reasons for not running??? GPUs not seated properly Thermal - GPUs do tend to get very hot when doing computational work - you may need to de-dust them. Power supply not working properly |
Send message Joined: 28 Oct 21 Posts: 7 |
Yes BOINC sees both GPUs. 5/18/2022 10:38:41 AM | | CUDA: NVIDIA GPU 0: NVIDIA GeForce RTX 2070 (driver version 470.99, CUDA version 11.4, compute capability 7.5, 4096MB, 3968MB available, 7465 GFLOPS peak) 5/18/2022 10:38:41 AM | | CUDA: NVIDIA GPU 1: NVIDIA GeForce RTX 2070 (driver version 470.99, CUDA version 11.4, compute capability 7.5, 4096MB, 3968MB available, 7465 GFLOPS peak) 5/18/2022 10:38:41 AM | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce RTX 2070 (driver version 470.103.01, device version OpenCL 3.0 CUDA, 7981MB, 3968MB available, 7465 GFLOPS peak) 5/18/2022 10:38:41 AM | | OpenCL: NVIDIA GPU 1: NVIDIA GeForce RTX 2070 (driver version 470.103.01, device version OpenCL 3.0 CUDA, 7982MB, 3968MB available, 7465 GFLOPS peak) Both GPus are working as I use distributed.net opencl application that really push's the GPUs. |
Send message Joined: 25 May 09 Posts: 1295 |
Something strange, it appears you have two versions of the driver running. In the first pair of lines driver version 470.99 is reported, while in the second pair of lines 470.103.01 is reported. If you look at the lines from my computer you will see that the version is the same for all four lines, 511.65. Having mixed driver versions has given people some problems over the years. It may be that at some time in the past you didn't do a "clean" driver update (or Windows decided that your drivers needed updating and only did half the job). Version 470.103.01 was distributed with one of the toolkits, which aren't really needed for (most) BOINC projects. I'd head over the Nvidia site and get the current drives then do a "clean installation (this is an option often buried in small text when you start doing the installation) |
Send message Joined: 24 Dec 19 Posts: 229 |
I would go even further and boot into Safe Mode, DDU the driver install, check the option to prevent Windows from installing their own driver, then boot back into normal mode, install the driver from the latest Nvidia package. this gets rid of every last bit of previous drivers. simply doing a clean install with the new driver package doesn't wipe out everything. |
Send message Joined: 28 Oct 21 Posts: 7 |
So the machine in is running ubuntu, so I will have to do a clean install on the nvidia drivers. However the other machine (Windows ) that I have issues with has the same versions of the drivers and it appears the task cpu/gpu scheduler is causing the issue of not running the other task on the other GPU. I suspect this because of the following. 1) If I play around with the Computing preferences CPU usage limit ( % of CPU ) I can get the other GPU to start processing. 2) If I delete all tasks I can the the other GPU to run for a couple of days. |
Send message Joined: 14 Aug 19 Posts: 55 |
Older BOINC versions had problem with Einstein, under some circumstances they would download more work than could be completed by the deadline. This might be your problem, it explains why things work correctly for a couple days after you delete tasks and then the problem comes back. I'm certain the excessive work problem affects the 7.16 series. The easiest thing to do is update BOINC to a newer version and see if that fixes it. Keep in mind some of those Einstein GPU tasks actually require more than one thread, so your CPU usage settings might not match the workload like you think. Team USA forum Follow us on Twitter Help us #crunchforcures! |
Send message Joined: 28 Oct 21 Posts: 7 |
The Linux client is not the 7.16.20 as reported earlier. Its running : 5/29/2022 10:00:29 PM | | Starting BOINC client version 7.18.1 for x86_64-pc-linux-gnu 5/29/2022 10:00:29 PM | | This a development version of BOINC and may not function properly. I will dig into the issue of GPU task requires more that one thread, and play around with the CPU settings. Thanks, BoincSpy |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.