> Generally speaking they outperform GPUs by an order of magnitude in FLOPS/W\*s...

zerohp · on June 1, 2015

Altera is claiming they will have >10 TFLOPS next year. They designed floating point DSP blocks in the Arria 10 and Stratix 10 (due out 2016Q1).

https://www.altera.com/content/dam/altera-www/global/en_US/p...

gchadwick · on June 1, 2015

It would be interesting to the see the same experiment repeated using an NVidia Tesla and an Intel Xenon Phi. They used AMD GPUs not targeted at HPC so it's unsurprising the integer path is not power efficient (desktop/mobile graphics is all floating point).

mrb · on June 1, 2015

Repeating the experiment with Tesla or Xenon Phi will show you the same thing: that GPUs are less efficient than FPGAs in this load. Their inferior efficiency has nothing to do with whether the polyphase channelization load is integer or floating point. A GPU consists of hundreds or thousands of microprocessors that have a traditional architecture: instruction decoding block, execution engines, registers, etc. Decoding and executing instructions is inherently less power-efficient than having this logic hard-wired as it can be in an FPGA.

vardump · on June 1, 2015

> A GPU consists of hundreds or thousands of microprocessors that have a traditional architecture: instruction decoding block, execution engines, registers, etc.

Any example of a GPU with "hundreds or thousands of microprocessors"? Nvidia Titan X has 12 [1] microprocessors by your definition.

[1]: SM, Streaming Multiprocessor in Nvidia's terminology. Smallest unit that can branch, decode instructions, etc.

mrb · on June 1, 2015

I am well aware of the technical details and that I used a liberal definition of "microprocessor". My wording was vague on purpose (I didn't want to delve into the details). I didn't mean to imply that each "microprocessor" had their own instruction decoding block (they don't).

An AMD Radeon R9 290X has 2816 stream processors (44 compute units of 64 stream processors) per their terminology. There is only 1 instruction decoder per compute unit, so a stream processor cannot completely branch off independently, but it can still follow a unique code path via branch predication. This is kind of comparable to an Nvidia GPU having "44 streaming multiprocessors".

But whether you call this 44 or 2816 processors is irrelevant to my main point: a processor that has to decode/execute 44 or 2816 instructions in a single cycle while supporting complex features like caching, branching, etc, is going to be less efficient than a FPGA with hard-wired logic (edit: "hard-wired" from the view point of "once the logic has been configured").

gchadwick also said integer workloads were "not power efficient" on GPUs, but that's also false. Most SP floating point and integer instructions on GPUs are optimized to execute in 1 clock cycle, so they are equally optimized. And of course integer logic needs fewer transistors than floating point logic, so an integer operation is going to consume less power than the corresponding floating point operation.

makomk · on June 1, 2015

FPGA's don't actually have "hard-wired logic" though - they have a configurable routing fabric that takes up a substantial proportion of the die area and has much worse propagation delays than actual hardwired logic, leading to lower clocks than chips like GPUs. Being able to connect logic together into arbitrary designs at runtime is prerty expensive.

m_mueller · on June 1, 2015

Thanks for pointing it out, I'm so used to FLOP used for benchmarks that I don't even question it anymore - mega samples didn't tick me off as being IP only.

lqdc13 · on June 1, 2015

I think AMD GPUs are much better at integer operations. NVIDIA ones are good at floating point.