Thread 'Specifications for NVidia RTX 30x0 range?'

Message boards : GPUs : Specifications for NVidia RTX 30x0 range?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ProfileJord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15542
Netherlands
Message 100865 - Posted: 26 Sep 2020, 18:30:46 UTC

Errors may also be because of cheap components causing internal corruption in the 3080s made by third party manufacturers.
ID: 100865 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 885
United States
Message 100866 - Posted: 26 Sep 2020, 18:52:00 UTC - in response to Message 100865.  

The error at GPUGrid has nothing to do with card hardware. The problem is the apps don't understand the new arch and Compute Capability of SM_8.6 which the apps proclaim "out of range" when the app is run time compiled by the nvrtc module in the drivers.
ID: 100866 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5121
United Kingdom
Message 100867 - Posted: 26 Sep 2020, 19:11:55 UTC - in response to Message 100866.  

Agreed. They've used a funny sort of CUDA app development which requires explicit pre-knowledge of the card characteristics.

The exact error message (on an A100, cc8.0 datacentre GPU) is

# Engine failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)
ID: 100867 · Report as offensive
ProfileKeith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 885
United States
Message 100868 - Posted: 26 Sep 2020, 21:54:35 UTC - in response to Message 100867.  
Last modified: 26 Sep 2020, 22:04:18 UTC

Maybe they are waiting on the CUDA 11.1 drivers to be made available in the distros and PPA's. From a Phoronix news article:

CUDA 11.1 also brings a new PTX compiler static library, version 7.1 of the Parallel Thread Execution (PTX) ISA, support for Fedora 32 and Debian 10.3, new unified programming models, hardware-accelerated sparse texture support, multi-threaded launch to different CUDA streams, improvements to CUDA Graphs, and various other enhancements. GCC 10.0 and Clang 10.0 are also now supported as host compilers.

That PTX module seems interesting. I wonder if that will allow apps to make use of the dormant FP32 pipeline.

[Edit]From the CUDA 11.1 Ampere docs:
Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.
ID: 100868 · Report as offensive
ProDigit

Send message
Joined: 8 Nov 19
Posts: 718
United States
Message 100885 - Posted: 29 Sep 2020, 5:29:53 UTC - in response to Message 100841.  
Last modified: 29 Sep 2020, 5:31:34 UTC

How do you want to set that experiment up? What parameters are you looking for?
This is just for baseline BOINC users, not fancy optimisers. Ideally a single 30x0 card, in a host with plenty of power and cooling (so nothing gets throttled). Run a known - preferably CUDA - app for long enough to get a good idea of performance. Slap in an app_config.xml file with <gpu_usage>.5</gpu_usage>, and record what happens.

Ok, I will ask Till to run his RTX 3080 at Primegrid with an app_config with 0.5 gpu usage. That is a CUDA application.

On my 2080Ti I run some Einstein@home WUs at 0.333
I can imagine that the 3080 would be able to run them at 0.25
However, doing so, I'd be interested in what the minimum PCIE bandwidth should be.
PCIE 3.0 x8 'should' be fine, but interested in someone testing x16, vs x8, vs x4 on those cards...
ID: 100885 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 229
United States
Message 102139 - Posted: 14 Dec 2020, 4:49:41 UTC
Last modified: 14 Dec 2020, 4:51:11 UTC

as an added data point. peak_flops for my 3070 in BOINC gets detected as roughly 10TFlops, when it's really about 20TFlops.

definitely under reporting by half, due to the change in cores/SM
ID: 102139 · Report as offensive
salvador77

Send message
Joined: 31 Jan 22
Posts: 1
Message 106921 - Posted: 31 Jan 2022, 4:02:19 UTC - in response to Message 100834.  

HI!

Im not an expert on the config for BOINC projects, neither the architecture. However, I have both a 3080 and titan volta and 3080 has almost same eficiency than titan volta...

and Boinc doesnt report diferences on diferent "coprocesadores" just says two 3080.

https://einsteinathome.org/es/host/12916614

of course, I have several questions about a more eficient configuration

Regards.
ID: 106921 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 229
United States
Message 106929 - Posted: 31 Jan 2022, 15:43:32 UTC - in response to Message 106921.  

HI!

Im not an expert on the config for BOINC projects, neither the architecture. However, I have both a 3080 and titan volta and 3080 has almost same eficiency than titan volta...

and Boinc doesnt report diferences on diferent "coprocesadores" just says two 3080.

https://einsteinathome.org/es/host/12916614

of course, I have several questions about a more eficient configuration

Regards.


Yes the Titan V and Ampere (GDDR6X versions) have about the same efficiency (performance per watt) on Einstein. This is mainly due to memory performance. The Titan V has HBM2 memory which has very low latency, and the GDDR6X cards can achieve close to the same performance due to raw speed (19+ Gbps). I think the TitanV is a little more power efficient, but faster 3080Ti and 3090 models are overall faster and more productive, but using more power.

About the host reporting two 3080, this is an idiosyncrasy in how BOINC works. It will decide the “best” GPU you have, and for Nvidia cards, the strongest determinator of “best” is the compute capability. The Ampere cards have CC 8.6 and the Volta card has CC 7.0. So the system chooses the Ampere card to display, and appends that you have 2 total Nvidia GPUs.

With the current LAT3000 series tasks being distributed, you’ll probably find max production by running 3-4 tasks at a time on each GPU. However, if the project goes back to crunching LAT4000 series tasks, 1x task per GPU will be best. Just watch what tasks are being distributed and adapt as necessary.
ID: 106929 · Report as offensive
ProDigit

Send message
Joined: 8 Nov 19
Posts: 718
United States
Message 107556 - Posted: 23 Mar 2022, 3:58:25 UTC - in response to Message 100797.  
Last modified: 23 Mar 2022, 3:59:59 UTC

...the GA102 (and above, but not the A100) benefit from both an increase from 64 to 128 cores per SM, and the ability to process two FP32 streams concurrently.

It was my understanding that the RTX 3060 has cuda cores (shaders) operating either at 32bit INT, OR 32bit FPP.
It's not 2 streams concurrently. It's either/or.
32bit INT works faster if you don't need to have as precise numbers. 32bit FPP is more precise, and a bit slower (and less hardware supports it).

It's funny, because it reminds me of my audio modelling days.
I'd run samples and effects, and there was this effect that used 32 bit INT reverb, and the reverb sounded a bit more metallicky.
Meanwhile 32 bit float, sounded like a 'perfect' reverb.
So the human ear was able to differentiate between the two, much like the eye can see the difference between 32bit INT (255 values of RGB), and 32bit float (255 values of RGB, and 64 bit per pixel shaded).
It's probably close to the ear and eye's maximum perceivable range of colors and sounds; which is why I never got why some digital stomp box pedals were sold with 24 bit reverbs. They sound like trash.
Things like 3D polygons run just fine on INT. Not sure if 32bit Float would work in a 3D game environment..
Anyway, but that's off topic.
ID: 107556 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 229
United States
Message 107561 - Posted: 23 Mar 2022, 12:37:33 UTC - in response to Message 107556.  
Last modified: 23 Mar 2022, 12:45:16 UTC

...the GA102 (and above, but not the A100) benefit from both an increase from 64 to 128 cores per SM, and the ability to process two FP32 streams concurrently.

It was my understanding that the RTX 3060 has cuda cores (shaders) operating either at 32bit INT, OR 32bit FPP.
It's not 2 streams concurrently. It's either/or.
32bit INT works faster if you don't need to have as precise numbers. 32bit FPP is more precise, and a bit slower (and less hardware supports it).


this is an incorrect understanding.

Both Turing and Ampere have concurrent FP32/INT processing.

Page 11 of the Turing whitepaper: (source)

Turing implements a major revamping of the core execution datapaths. Modern shader
workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler
instructions such as integer adds for addressing and fetching data, floating point compare or
min/max for processing results, etc. In previous shader architectures, the floating-point math
datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second
parallel execution unit next to every CUDA core that executes these instructions in parallel with
floating point math.


Ampere added onto this by making that second data path FP32 capable as well. It's two stream concurrently. one is FP32, the other is either FP32 or INT32
ID: 107561 · Report as offensive
Previous · 1 · 2

Message boards : GPUs : Specifications for NVidia RTX 30x0 range?

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.