The shader core in AMD’s Radeon HD 2900 XT is very similar to the Xenos chip’s shader core, in that it can process 5D instructions. However, unlike the Xenos (which combines a vec4 instruction – a vector instruction with four components – with a scalar instruction), R600 uses five-way superscalar shader processors.
These are arranged in clusters of 16 shaders, or 80 stream processing units counting the ALUs individually. Each of shader units can co-issue up to five FP MAD (multiply add) instructions per clock with 32-bit floating point precision. In addition to the five scalar stream processor units inside each shader unit, there is also a branch execution unit that handles flow control and conditional operations. AMD says that this helps to “practically eliminate flow control performance overhead.”
In addition, each shader unit has its own array of dedicated general purpose registers that are there to store input data, temporary values and output data. In addition, there is a single 64KB memory read/write cache that can be accessed by any of the shader clusters. Data inside this cache can be exported directly into the stream out buffer – a new feature in DirectX 10 that allows developers to write directly to memory from shader to memory without having to go out through the render backend.
R600’s stream processor configuration is quite different to the way Nvidia arranged the shader units in its G80 graphics processing unit, where there are clusters of 16 fully decoupled and fully generalised scalar units per cluster. Each cluster had its own dedicated L1 cache, while each ROP partition had its own L2 cache that could be shared with any of the stream processor clusters via the crossbar.
R600 features four texture units that operate independently to the four shader clusters. The units can each do eight texture addresses per clock, of which four are used for unfiltered lookups and the other four are used for bilinear lookups. Also, each texture unit can fetch 20 FP32 textures for bilinear filtering and point sampling, while also being able to apply bilinear filtering to four FP16 textures every clock cycle. AMD says that bilinear filtering FP32 textures is done at half speed. The units also support trilinear and anisotropic filtering on all formats.
In comparison, R580 had to use its pixel shader units for FP16 and FP32 texture formats because its texture units didn’t fully support the formats in the dedicated hardware. Taking this into account, AMD believes that R600 can filter FP16 textures – which are often used when HDR rendering is implemented – around seven times faster than it could with its flagship R580-based Radeon X1950 XTX.
Each unit has access to both the 32KB vertex and L1 texture cache, which is said to improve throughput on unfiltered texture reads. In addition, AMD has implemented a shared 256KB L2 texture cache too – a first for ATI graphics products. It allows the chip to catch and store very large textures and pixels (that are too large for the L1 cache) locally in order to save bandwidth.
Render Backends (ROPs) and Z:
In total, there are four ROP partitions in R600 that can each output four pixels per clock with colour and Z processing. With Z-only tests, R600's ROP partitions operate at double-speed, regardless of whether anti-aliasing is enabled or not. Speaking of Z, AMD has made lots of improvements to its depth, stencil and compression techniques.
Depth, or the Z axis as it’s more commonly known, and stencil compression has been increased to 16:1 in standard mode, compared to 8:1 in R580, and that scales to 128:1 with 8xMSAA. AMD has achieved this by compressing Z and stencil values separately, which is said to improve efficiency.
In the previous generation, Z and stencil compression was limited to five megapixels. AMD has removed this limit (which can cause a performance drop off at higher resolutions) by caching compression information in on-chip cache, and storing it in graphics memory.
AMD has further optimised its Z-buffer with the introduction of Re-Z, which basically checks the Z-buffer twice – once before the pixel shader and then again after the pixel has passed through the pipeline. This is similar to Nvidia’s early-Z implementation on G80, which tests and culls pixels before they enter the pixel shader. The Hierarchial Z-buffer has also been improved with the introduction of hierarchical stencil, which helps to improve stencil shadow performance.
The ROPs also include improved performance in post processing effects – the effect that AMD specifically mentioned was render-to-texture.
The new ROP hardware also includes up to eight multiple render targets, with support for all of the common anti-aliasing formats. While on the subject of anti-aliasing, AMD has introduced a new 8x multisample anti-aliasing mode, which is done using AMD’s programmable sample grid. However, in a rather bizarre move, AMD has decided to drop the 6xMSAA mode in favour of the new 8xMSAA mode. Ideally, we would have liked to see both modes left intact, especially given that they’re programmable grid patterns.
Obviously, there is still support for both adaptive multisampling and adaptive supersampling for alpha tested textures and there’s still support for SuperAA in CrossFire (although it’s limited to only one pattern at the moment – 16x). As was the case with ATI's X1000-series, all of the anti-aliasing modes support simultaneous use of FP16 or FP32 render targets, meaning that the ROP hardware can do up to 128-bit HDR with anti-aliasing enabled.
AMD has also introduced a new anti-aliasing mode, known as Custom Filter anti-aliasing, which is targeted at Nvidia’s Coverage Sampling anti-aliasing technique that it introduced with its GeForce 8800-series. It makes use of non-box filters and uses AMD’s programmable resolve stage in its new ROP hardware – I guess the fact it’s programmable means that we’re likely to see more filters, or improvements to the current filters, developed over time.
In the driver we’ve used for our testing here, there are currently two additional filters: narrow tent and wide tent. We’ve also been told that there is an edge detect filter on the way, too. We’ve got an alpha driver that uses the new filter, but we haven’t had the time to do any testing with it yet. We’ll be having a closer look at AMD’s ATI Radeon HD 2000-series image quality in a follow up article later this week.