Thread 'Specifications for NVidia RTX 30x0 range?'

Author	Message
Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15552	Message 100865 - Posted: 26 Sep 2020, 18:30:46 UTC Errors may also be because of cheap components causing internal corruption in the 3080s made by third party manufacturers. ID: 100865 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 888	Message 100866 - Posted: 26 Sep 2020, 18:52:00 UTC - in response to Message 100865. The error at GPUGrid has nothing to do with card hardware. The problem is the apps don't understand the new arch and Compute Capability of SM_8.6 which the apps proclaim "out of range" when the app is run time compiled by the nvrtc module in the drivers. ID: 100866 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5124	Message 100867 - Posted: 26 Sep 2020, 19:11:55 UTC - in response to Message 100866. Agreed. They've used a funny sort of CUDA app development which requires explicit pre-knowledge of the card characteristics. The exact error message (on an A100, cc8.0 datacentre GPU) is # Engine failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) ID: 100867 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 888	Message 100868 - Posted: 26 Sep 2020, 21:54:35 UTC - in response to Message 100867. Last modified: 26 Sep 2020, 22:04:18 UTC Maybe they are waiting on the CUDA 11.1 drivers to be made available in the distros and PPA's. From a Phoronix news article: CUDA 11.1 also brings a new PTX compiler static library, version 7.1 of the Parallel Thread Execution (PTX) ISA, support for Fedora 32 and Debian 10.3, new unified programming models, hardware-accelerated sparse texture support, multi-threaded launch to different CUDA streams, improvements to CUDA Graphs, and various other enhancements. GCC 10.0 and Clang 10.0 are also now supported as host compilers. That PTX module seems interesting. I wonder if that will allow apps to make use of the dormant FP32 pipeline. [Edit]From the CUDA 11.1 Ampere docs: Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput. ID: 100868 ·

ProDigit Send message Joined: 8 Nov 19 Posts: 718	Message 100885 - Posted: 29 Sep 2020, 5:29:53 UTC - in response to Message 100841. Last modified: 29 Sep 2020, 5:31:34 UTC How do you want to set that experiment up? What parameters are you looking for? This is just for baseline BOINC users, not fancy optimisers. Ideally a single 30x0 card, in a host with plenty of power and cooling (so nothing gets throttled). Run a known - preferably CUDA - app for long enough to get a good idea of performance. Slap in an app_config.xml file with <gpu_usage>.5</gpu_usage>, and record what happens. Ok, I will ask Till to run his RTX 3080 at Primegrid with an app_config with 0.5 gpu usage. That is a CUDA application. On my 2080Ti I run some Einstein@home WUs at 0.333 I can imagine that the 3080 would be able to run them at 0.25 However, doing so, I'd be interested in what the minimum PCIE bandwidth should be. PCIE 3.0 x8 'should' be fine, but interested in someone testing x16, vs x8, vs x4 on those cards... ID: 100885 ·

Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 229	Message 102139 - Posted: 14 Dec 2020, 4:49:41 UTC Last modified: 14 Dec 2020, 4:51:11 UTC as an added data point. peak_flops for my 3070 in BOINC gets detected as roughly 10TFlops, when it's really about 20TFlops. definitely under reporting by half, due to the change in cores/SM ID: 102139 ·

salvador77 Send message Joined: 31 Jan 22 Posts: 1	Message 106921 - Posted: 31 Jan 2022, 4:02:19 UTC - in response to Message 100834. HI! Im not an expert on the config for BOINC projects, neither the architecture. However, I have both a 3080 and titan volta and 3080 has almost same eficiency than titan volta... and Boinc doesnt report diferences on diferent "coprocesadores" just says two 3080. https://einsteinathome.org/es/host/12916614 of course, I have several questions about a more eficient configuration Regards. ID: 106921 ·

Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 229	Message 106929 - Posted: 31 Jan 2022, 15:43:32 UTC - in response to Message 106921. HI! Im not an expert on the config for BOINC projects, neither the architecture. However, I have both a 3080 and titan volta and 3080 has almost same eficiency than titan volta... and Boinc doesnt report diferences on diferent "coprocesadores" just says two 3080. https://einsteinathome.org/es/host/12916614 of course, I have several questions about a more eficient configuration Regards. Yes the Titan V and Ampere (GDDR6X versions) have about the same efficiency (performance per watt) on Einstein. This is mainly due to memory performance. The Titan V has HBM2 memory which has very low latency, and the GDDR6X cards can achieve close to the same performance due to raw speed (19+ Gbps). I think the TitanV is a little more power efficient, but faster 3080Ti and 3090 models are overall faster and more productive, but using more power. About the host reporting two 3080, this is an idiosyncrasy in how BOINC works. It will decide the “best” GPU you have, and for Nvidia cards, the strongest determinator of “best” is the compute capability. The Ampere cards have CC 8.6 and the Volta card has CC 7.0. So the system chooses the Ampere card to display, and appends that you have 2 total Nvidia GPUs. With the current LAT3000 series tasks being distributed, you’ll probably find max production by running 3-4 tasks at a time on each GPU. However, if the project goes back to crunching LAT4000 series tasks, 1x task per GPU will be best. Just watch what tasks are being distributed and adapt as necessary. ID: 106929 ·

ProDigit Send message Joined: 8 Nov 19 Posts: 718	Message 107556 - Posted: 23 Mar 2022, 3:58:25 UTC - in response to Message 100797. Last modified: 23 Mar 2022, 3:59:59 UTC ...the GA102 (and above, but not the A100) benefit from both an increase from 64 to 128 cores per SM, and the ability to process two FP32 streams concurrently. It was my understanding that the RTX 3060 has cuda cores (shaders) operating either at 32bit INT, OR 32bit FPP. It's not 2 streams concurrently. It's either/or. 32bit INT works faster if you don't need to have as precise numbers. 32bit FPP is more precise, and a bit slower (and less hardware supports it). It's funny, because it reminds me of my audio modelling days. I'd run samples and effects, and there was this effect that used 32 bit INT reverb, and the reverb sounded a bit more metallicky. Meanwhile 32 bit float, sounded like a 'perfect' reverb. So the human ear was able to differentiate between the two, much like the eye can see the difference between 32bit INT (255 values of RGB), and 32bit float (255 values of RGB, and 64 bit per pixel shaded). It's probably close to the ear and eye's maximum perceivable range of colors and sounds; which is why I never got why some digital stomp box pedals were sold with 24 bit reverbs. They sound like trash. Things like 3D polygons run just fine on INT. Not sure if 32bit Float would work in a 3D game environment.. Anyway, but that's off topic. ID: 107556 ·

Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 229	Message 107561 - Posted: 23 Mar 2022, 12:37:33 UTC - in response to Message 107556. Last modified: 23 Mar 2022, 12:45:16 UTC ...the GA102 (and above, but not the A100) benefit from both an increase from 64 to 128 cores per SM, and the ability to process two FP32 streams concurrently. It was my understanding that the RTX 3060 has cuda cores (shaders) operating either at 32bit INT, OR 32bit FPP. It's not 2 streams concurrently. It's either/or. 32bit INT works faster if you don't need to have as precise numbers. 32bit FPP is more precise, and a bit slower (and less hardware supports it). this is an incorrect understanding. Both Turing and Ampere have concurrent FP32/INT processing. Page 11 of the Turing whitepaper: (source) Turing implements a major revamping of the core execution datapaths. Modern shader workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler instructions such as integer adds for addressing and fetching data, floating point compare or min/max for processing results, etc. In previous shader architectures, the floating-point math datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second parallel execution unit next to every CUDA core that executes these instructions in parallel with floating point math. Ampere added onto this by making that second data path FP32 capable as well. It's two stream concurrently. one is FP32, the other is either FP32 or INT32 ID: 107561 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.