Removal from Alder Lake was indeed regrettable but AVX-512 is still included in several recent client (non-server) CPUs: Icelake, Rocket Lake, Tiger Lake.
The frequency scaling problems most prevalent in Skylake are significantly better as of e.g. Ice Lake. It's complex to summarize but basically those instructions request a higher "power license" for higher power delivery when used, broadly speaking. For single/dual core workloads on my laptop there is no frequency scaling loss IIRC, and even at full bore on all 4 cores I think the drop is from 3.6 to 3.5 nominal max clock. Ice Lake Client only has 1x FMA unit however, I don't know of any benchmarks of Ice Lake SP where there's 2x FMA units. You can reliably handle full datapath AVX-512 usage in client workloads these days, on Ice Lake or later, IMO. (Of course when you're on a laptop, the difference of 3.5GHz vs 3.6GHz on draining your battery doesn't feel too different...)
Wide-datapath designs will generally have some tradeoffs like needing to ramp up power delivery, so there will be things like initial instruction latency for wide-datapath instructions, etc. That's kind of inevitable; I suspect the same will be true of the new fancy "variable length" vector ISAs if the underlying implementation and vector usage is wide enough, too.
Also: you don't need to use wide 512-bit vectors with AVX-512! You can use the instruction set with 128-or-256/bit vectors just fine.
The laptop chips have never had "downclocking problems" in the sense that Skylake-SP did. AFAIK they are actually some of the top performers for ps3 emulation/etc.
Actually supposedly Skylake-X/Xeon-W had much lower AVX downclocking than Skylake-SP too... InstLat64 made a tweet at one point showing this was 10-20% for workstation vs 30-40% for server iirc. Tweet has been removed unfortunately.
Intel really throttled it down on server chips, for whatever reason. Probably didn't want datacenter chips to run the Unlimited Voltage that was necessary for full-clock dual-unit AVX-512 on 14nm.
It depends on the CPU type (bronze/silver < gold/platinum). My workstation saw 1.4-1.6x end to end application speedups, including throttling (even on multiple cores) for JPEG XL decoding and vqsort.
[ General observation, not directed at parent comment: ]
Frequency throttling, even on the most affected Skylakes, has always been a non-issue if you run say 1ms worth of continuous SIMD instructions. How could a 10-40% drop negate speedups from 2x vector width plus double the registers and a much more capable instruction set?
You can configure throttling in the bios if you have Skylake-X.
You can set it to whatever you want. The caveat being it won't necessarily be stable (depending on voltage) nor will your cooling necessarily be able to handle it.
My 10980XE runs AVX2 at 4.2 GHz all core, and AVX512 at 4GHz.
Not entirely. With Skylake, there was a clock speed penalty for "heavy" 256b instructions and "light" 512b instructions, and a steep clock speed penalty for "heavy" 512b instructions. With Ice Lake, there is a very small single-core clock speed penalty for 512b instructions. There is no clock speed penalty after Ice Lake. (Which, for non-server CPUs, is currently a list containing one generation: Rocket Lake.)
https://www.tomshardware.com/news/intel-nukes-alder-lake-avx...