I worked on a PCIE Gen3 retimer. The problem was you were trying to fool both ends into thinking you weren't there and was kind of violating the spec. PCIE Gen4 has explicitly defined the terms allowing you to make a chip that is compliant with the spec to retime the signal.
> fool both ends into thinking you weren't there and was kind of violating the spec.
Long-range USB (e.g. for conference rooms) has the same problem. The maximum one-way propagation delay is a hard restriction in the standard. You have to somehow cheat the host into thinking the device is busy before the peripheral at the far side is able to respond. To this day, I believe no standard solution exists, it seems everyone's reinventing the wheel in their FPGAs. Although it's only a niche application, standardization is unlikely [0]. BTW, FireWire did have a perfect solution to this problem because good networking was one of its design goals.
[0] And for relatively shorter runs it's not usually a problem - cascading USB hubs is good enough, it's how most active extension cables work - as one-port hubs.
USB sucks. Bigtime. Way, way more than PCIe. PCIe is great. PCIe will run over wet string* . USB... USB is a world of pain.
This is the same problem any kind of USB virtualization/network tunnelling has. Also, anyone trying to build a USB2 to USB3 transaction translator (which the spec, unforgivably, omitted). AFAIK there is one USB2 to USB3 TT chip in the market and it's unobtainium.
It works if you lift things back to the URB level to some extent, and "virtualize" the lower protocol layers, but there are corner cases once you introduce a frame of latency or more, which you have to. The host polls for data, then you're polling the device for data. You get back some data, but the next frame the host has stopped polling for data. Now you have some data you have to drop on the floor. Not good. You could choose to delay a successful ACK until you get it from the other side (and thus pretend like the first time the data was sent it was corrupted, even though it was forwarded successfully), but now you've massively decimated your maximum throughput. The same problem happens in the other direction too.
Thankfully, in practice these race conditions often hit software anyway (i.e. if you cancel a submitted URB then if data came back it was already acked and it'll be dropped on the floor), so you can get away with it since drivers should be designed not to screw this up. But it's still easier at the software level (e.g. virtualizing) because you actually know what the host is trying to do. I wrote the QEMU virtual xHCI code and the original prototype, which ignored the QEMU USB subsystem and passed through straight to the host, worked very well (better than anything VMware/virtualbox were doing) because xHCI is high level enough. But at the wire level, you don't even know when the host has cancelled a transaction other than just... because it chooses not to send packets in a given frame... and what if it just decided not to poll that frame but it will poll later? Now you need timeout heuristics... it's horrible.
* seriously, I've tunnelled PCIe over 115200 baud RS232. You can do all kinds of horrible things to PCIe and it will still work.
> This is the same problem any kind of USB virtualization/network tunnelling has.
+1.
I've been using QubesOS on my workstation for a while, and I found the problem associated with USB virtualization/tunnelling also has a major impact on QubesOS's usability (USBility?). Long story short, QubesOS's security model depends on the ability to isolate individual peripherals at the hypervisor level. For PCI-E it's easy, but for USB it's a headache. It has various security, compatibility and performance problems. If one has ever encountered such a problem, the only workaround is assigning an entire EHCI or xHCI USB host controller at the PCI-E level to a VM, and this is only feasible on machines with multiple USB controllers.
> I wrote the QEMU virtual xHCI code and the original prototype, which ignored the QEMU USB subsystem and passed through straight to the host, worked very well
I also got the idea to build something similar, but in hardware - put 4 individual PCI-E USB host controllers behind a PCI-E switch, so that each USB port can be seen as an individual PCI-E device. Now the USB problem in Qubes is solved forever...
Yes, this is exactly what I was describing, and it can be the solution. However, I was thinking of a downscaled USB 2.0 version - only has a fraction of its cost, and it doesn't eat all my PCI-E lanes (20 Gbps vs 1.92 Gbps, the latter one only needs a PCI-E x1 Gen 2).
A retimer is a much simpler device (thus cheaper) and can be tested using linear methods. To test a line with a redriver you really need a full on BERT system + a VNA whereas you can get by with just the VNA for the retimer. To accurately characterize PCIe Gen 4 and 5 you need to go to almost 50 GHz of bandwidth on your test equipment so you are talking big money on hardware for test, design, and validation.
Meanwhile, a redriver doubles your power budget on the driver since it literally is a second set of drivers in the middle of your PCIe line.
This is an ad from a company that sells redrivers. It's an informative ad, but still an add so it should be viewed with a critical eye.
You're right, I flipped the names in my head. I probably shouldn't be posting so late so I'm going to bed after this comment.
You can go a long ways with careful design and quality redrivers, but at some point it will be better to go to a retimer. Which you should use is a matter of use case and engineering decisions. Just please be skeptical of the conclusion that "Usage [of redrivers] is highly discouraged; use at own risk" from a company that specifically designs and sells retimers.
From my experience any company that is developing interconnects either has the needed test equipment or uses a test house. It seems strange that access to test equipment should factor into an engineering decision.
The test equipment is hundreds of thousands of dollars each and everything used for PCIe3 doesn't have the bandwidth to go to PCIe 4. The bigger companies have no problem scaling up but for smaller players it's a serious consideration. Even at satellite offices of larger companies, it can be difficult to get the capital outlay for your lab to buy something like a BERT. We're talking millions of dollars at scale. Fortunately, most of the equipment that works for PCIe4 will also work for PCIe5.
I worked in a test house for seven years. The last thing I did before I left is automate PAM4 25GBaud transmitter and receiver testing. I am intimately familiar with the costs. I am also familiar with how necessary it is for every player to have access to this equipment.
And so if there is a need to increase bandwidth there could be an engineering decision to either go with faster links or more links at the existing speed. Faster links has the extra cost of involving more expensive test equipment, more links does not. Strictly speaking perhaps engineering decisions should not consider cost and that's one for the bean counters, but I don't think that's the reality for smaller teams where those two worlds merge.
You're right that a design, verification and test bench can easily rack-up several hundred thousand dollars worth of equipment.
But there are trade-offs that can lower the capital cost, like leasing the equipment, using a third party for aspects of the design/test, carefully evolving known good designs and simply budgeting a lot more for more PCB re-spins.
In the finally analysis those to want to make products in this domain aren't doing it on a shoestring budget, it's going to cost a lot no matter what.
You're lucky to get latency under a microsecond on PCIe to begin with. A buffer of 20 bits would be about 2 nanoseconds and completely negligible.
Though apparently retimers are allowed to spend up to 64 nanoseconds? What takes so long? Especially when the PCIe spec casually asserts the existence of <10ns retimers.
Nice article. Didn't even consider you'd ever need something like this in a single system. Also never considered that a "redriver" would be able to help at all with multi-GHz signals.
(Note: I'm very unknowledgeable about this topic.)
Redrivers and retimers are also known as "signal conditioners", because it's what they do - to improve the quality of a marginal or non-compliant signal. They are used in many different interfaces to accommodate long cable/board run - USB, HDMI, DisplayPort, SATA, just to name a few [0]. USB is a relatively new market since USB 3.1 and USB 3.2, due to its signal integrity challenge at high data rate.
Retimers are digital repeaters. Redrivers are mostly analog, but they are not unity-gain amplifiers, often they do equalization and emphasis internally to compensate for the distortion at high frequency.
> A retimer is a mixed signal analog/digital device that is protocol-aware and has the ability to fully recover the data, extract the embedded clock and retransmit a fresh copy of the data using a clean clock.
How does the retimer recover the underlying clock if the signal is (say) all zeroes for an extended period of time?
And why is clock recovery from the data signal necessary in the first place? Can't the clock signal be recovered from the clock signal itself? Iow, why is the clock signal not one of the inputs to the retimer?
> And why is clock recovery from the data signal necessary in the first place? Can't the clock signal be recovered from the clock signal itself?
In many (if not most) serial communication systems, there's no clock signal. The clock is embedded in the data itself. The data is encoded in a way that always guarantees enough changes (e.g. a long run of '0' or '1' is restricted) , so that the link partner can use the data itself as a reference to recover the clock. This is true for USB, Ethernet, PCI-E, most fiber optics communication systems (installing another cable just for clocking is rather awkward), among others.
Ideally, one data signal is all we need. This is known as a self-clocking signal [0].
Thanks for the explanation, but when I look at a typical PCIe card, I see many connections (of which the clock could easily be one). It seems strange to convert to a serial connection with only two wires.
Indeed, among the 18 pins in a PCI-E x1 slot, only four wires forms the actual data link, two for RX and another two for TX [0]. In terms of space, A PCI-E connection is really a piece of cake.
And interestingly, to make a designer's life easier, PCI-E does have a reference clock signal, but it doesn't directly control the data transfer, it's only here for establishing a stable frequency reference, so a card doesn't have to include its own clock (which is a possible option). Since PCI-E Gen 2, its use is entirely optional because data recovery was possible, see [1].
> How does the retimer recover the underlying clock if the signal is (say) all zeroes for an extended period of time?
Data is encoded such that all-zeros or all-ones doesn't happen. For Gen3, eight bits are encoded with ten bits, look up "10/8 encoding" for details. For Gen4+, blocks of 128 bits are exclusive-ored with random numbers, plus a pair of bits to keep the number of one bits roughly equal to the number of zero bits. This is "130/128" encoding.
> And why is clock recovery from the data signal necessary in the first place? Can't the clock signal be recovered from the clock signal itself?
Recovering the clock signal from the data avoids the need for an additional clock wire.
The clock signal is not sent because it would require another high-bandwidth wire and would introduce problems of clock skew within a lane. In other words, having a separate clock means you need to ensure that the clock and data signals are synchronized at the receiver, which is pretty difficult to achieve.
PCIe has this problem for lane-to-lane skew, but not within a lane. The limit for lane-to-lane skew is 1.6 ns. If each lane might also have similar skew between the data and clock, then the per-lane bandwidth would be limited to a few hundred megabits per second (instead of the achieved gigabits per second).
This article makes me realize how messy the underlying world of digital communication really is. As a software engineer, I send a file "into the cloud" and assume that it is going to be magically copied bit-by-bit with zero errors while going from my disk through the PCIe bus, the Ethernet cable, the fiber and similarly on the other side. Turns out "digital" is just an abstraction of messy, noisy, distorting, attenuating physical reality.
You may enjoy reading the fiberglass weave effect in circuit board designs [0][1]. At multi-GHz frequencies, it turns out the grid-like microscopic structure of the circuit board creates dielectric constant and impedance variations, which can cause distortion in high-speed signals such as PCI-E. The solution? Always run the signal traces at an angle, so horizontal and vertical lines that are parallel to the microscopic grid is avoided.
At this point, even the direction of the traces matters, and it's not even the cutting-edge, every commercial server motherboard is designed with this effect in mind, pretty amazing, isn't it?