Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SK Hynix Reveals DDR5 MCR DIMM, Up to DDR5-8000 Speeds for HPC (anandtech.com)
63 points by rbanffy on Dec 12, 2022 | hide | past | favorite | 43 comments


They really need to step up their game with making higher capacity modules available to consumer HEDT market. I think the current roadmap is to intro 48 and possibly 64GB modules at the end of 2023. Without that, many folks who’ve used 4x32GB DDR4 configuration for professional work are left out, since the current MCs and modules can’t support that for DDR5 at reasonable speeds.

In other words, HEDT market doesn’t need faster RAM as much as it needs more of it.


> making higher capacity modules available to consumer HEDT market

One way I'm able to function is by getting tower servers. Selection isn't great, and they cost a lot of money, but you can get a machine with 8 or 12 memory slots that still fits under your desk (and that is surprisingly silent while you browse or read e-mail - it only comes to life when you start getting cores saturated by running things like `make -j 64`).


How noticeably slower is running 128GB DDR5 on AM5? Is it perceptible? I’m building a 7950x rig and am perplexed by the ram. The specs from amd are:

Max Memory Speed 2x1R DDR5-5200 2x2R DDR5-5200 4x1R DDR5-3600 4x2R DDR5-3600

Does anyone know if cas latency (CL) also drops if using 4 sticks? For example, if I use 4x32GB DDR5 5600 CL36-36-36-89 will it drop to 3600 while maintaining CL36 or will CL also slow to like CL 40?


Basically, on AM5 4x32GB sticks nominally run at only 3,600 (i.e. no better than DDR4, and at probably slightly higher latency). On Level1Techs forums, there’re some folks that got it working as high as 4,800, which seems to be the barrier, but even that is far from guaranteed.

On Intel the situation isn’t really any better. Technically on paper, the stock speed is 4,000 (i.e. +400MT/s compared to AMD), but I’m yet to hear about anyone running 5,200 fully stable in this config.

Also, primary timings like CL36 don’t really matter in DDR5. There’s a whole ActuallyHardwareOverclocking YouTube video on it.


Those are the bare minimum speeds that AMD guarantees. You can find real world info about this on reddit. My memory of this is a bit old, but iirc the Zen 4 memory controller can pretty much always do 4800 MT/s on 4 sticks. And you might be able to do 5000/5200 or maybe even a bit more depending on your motherboard & RAM. The situation has been getting better with newer BIOS versions; speeds were much lower around the initial Zen 4 release.


Found some links from an old notepad. These are close to a month old, so stability at higher speeds should have improved since then.

ASRock Steel Legend runs at 4800(?) https://old.reddit.com/r/ASRock/comments/xtt3ol/x670e_steel_... His initial 5200 was actually unstable, a comment said he went to 4800, not sure whether 4800 is stable

Gigabyte board could run 128gb "stable" at 5200 (need to expand comment thread near bottom for that) https://old.reddit.com/r/gigabyte/comments/y1ns7j/aorus_x670... Expanded comment about 5200 stable, and link to benchmark with it: https://old.reddit.com/r/gigabyte/comments/y1ns7j/aorus_x670...

TUF X670E can only run 128gb at 4400, not 4800 https://old.reddit.com/r/Amd/comments/ya3gxx/psa_dont_buy_an...

On ROG Strix X670, commenter could run 4x16gb at 6000 (6400 was untable). The commenter previously could only do 5200; BIOS update made 6000 work. OP with 4x32gb was stable at 4400, not 5000, didn't say about 4800. https://old.reddit.com/r/overclocking/comments/xzau6z/x670_m...

Many comments here: https://old.reddit.com/r/Amd/comments/xzj69v/is_anyone_runni...

He's running 128gb at 5200 on a Gigabyte Aorus Master 670E https://old.reddit.com/r/Amd/comments/yggpr5/7950x6900xt128g...


Yeah - I think the key here is 4x32GB specifically, not just 4 sticks. 128GB is much much harder on the memory controller, since those are dual-rank memory modules (or pseudo-4R due to how DDR5 is wired compared to DDR4), and we got 4 of them. So even 4,800 is far from guaranteed in this situation, on any platform atm, and it’s unclear at this point if and when the situation will improve, barring larger memory modules in 2x48Gb and 2x64GB configurations (which would be much easier to run) or the next gen of chips with stronger MCs.


> consumer HEDT market

high-end desktop for anybody curious


Thanks for clarifying. HEDT is somewhat of a stretch name here, since technically there’re no current consumer-centric HEDT platforms like Threadripper or X599, but in their place I’m talking about the next best choices like 7950x and 13900k.


Also, the line between HEDT and low-end tower server is quite blurry.


Why would this ever have needed an abbreviation??


The abbreviation makes it clear when you're talking about the specific, distinct product lines that are separate from the mainstream desktop hardware product lines, as opposed to when you are merely talking about a desktop that is high-end in some vaguely-defined way.

But the abbreviation is less useful when the continued existence of the HEDT product segment itself has been repeatedly called into question over the past few years. The death of multi-GPU gaming and explosion of CPU core counts available on mainstream desktop platforms have removed the two main reasons for having a separate CPU socket for high-end desktops. (Historically, HEDT products were usually at least artificially segmented from workstation products; the latter segment is not in danger of disappearing.)


I think Intel is in the process of rebranding HEDT into "Workstation" market.


Seems like you wanted Optane memory before they were cancelled

>Intel® Optane™ persistent memory is available in capacities of 128 GiB, 256 GiB, and 512 GiB and is a much larger alternative to DRAM which currently caps at 128 GiB.


This would benefit sequential access, but it'd either be disabled for random-access or pollute the caches with unused lines.

But in cases where sequential memory bandwidth is required, this is pretty cool! (But I assume Intel only, which would also be a bummer)


RAM is the new tape...


Somewhat ironic that we just reduced RAM channels from 64b down to 32b wide in DDR5 (but each DIMM has two channels). (newsflash if you missed it: desktop DDR5 is quad channel, but 32b, yes.)

SK Hynix: what if we increase the ram width to 64?


Yesterday I got presented an ad disguised as an educational YouTube video, but the content was really good and is worth watching if you have some time on your hands.

"How does Computer Memory Work?" [0]

Uploaded 1 month ago, 35 minutes playing time.

It even goes into explaining what the timings mean while showing an animation for it.

[0] https://www.youtube.com/watch?v=7J7X7aZvMXQ


Thanks, this video is amazing. I wonder if future students will just watch weeks of super high quality videos like this for the basic topics instead of listening to bored professors reciting the same slides for the 20th time. Would be a lot more time effective, you'd think!


Can we get modular GDDR6X VRAM for machine learning please?


The Xeon Max has HBM built-in. Kind of the same idea that came with Xeon Phis, but updated.


We seem to be getting to the point that CPUs are no longer I/O bound. This is sort of the end of an era, since starting as early as the 2000s we entered a time when I/O (RAM, disk, network) was the main bottleneck for most forms of compute.


I disagree. RAM is still a huge bottleneck. Even when there is sufficient bandwidth, latency is a performance killer.

I just spent most of the weekend trying to optimize a hash table lookup which is one of our biggest sources of cache misses (and CPU stalls). The CPI (cycles per instruction) in that function started as 13.9 and I have it down to 7.5 by re-ordering a few fields and cacheline-aligning the struct (so as to have 1 cache miss per iteration, rather than 2). Now I need to figure out what's wrong with the hash function, as the table should be big enough to hold everything without much pointer chasing on average, but I'm seeing we do at least one pointer deref on average before we find the entry we're looking for.


SDRAM latency is definitely not keeping up with improvements in throughput: if you go back to the time of the PC100 SDRAM standard (late 90s?) it was not uncommon to see RAM with CAS latency of 2 clocks and overall timings compatible with a latency of around 20-25ns; now with DDR5-4800, CL 34 is standard and latencies have only come down to about 15ns.

This despite the number of MT/s increasing 48-fold.


Computer architects use the term 'memory wall' to refer to this latency problem. CPU microarchitectures are constantly improved to improve IPC, Die technology helps increase CPU frequency, but memory access latency is not keeping up with the CPU improvements.


Right - I think, at this point, that L1 cache latency is worse than main-memory latency on a 'per CPU clock' basis than it was in the late 1990s!


Makes sense for a larger cache to take longer to decode an address.


I've had a devil of a time looking this up. Perhaps you'd know:

Back in the Amiga days, 60ns SIMMs were common. Some CPU accelerator boards could run faster if you upgraded to 50ns SIMMs instead. What did those numbers refer to? Latency? Time to fetch a byte? Something else?

I'd be interested in an apples-to-apples comparison of the old hardware with the new.


Those times referred to the speed of the actual ICs on the PCB. A “60 ns” SRAM IC takes up to 60 nanoseconds from the (input) address lines going stable to the (output) data lines going stable. If you sampled the data bus during those 60 nanoseconds, you might get incomplete data. Swapping for 50 ns modules means the ICs were verified to take less time for the data bus to be valid.

It’s a bit like overclocking your memory nowadays. Basically, the 50 and 60 nanosecond parts might be the same silicon, but the 50 ns ones were validated to perform at that speed. Today, a 3200 MT/s DDR4 module doesn’t mean it won’t run at 3600 MT/s; just that it was only verified to run at the former.

---

The big difference between the two memory formats, however, is that DDR is pipelined. In the old days, you would present an address on the bus, hold it, and then wait for the data to come back. Only after sampling the data could you request a new address. DDR, being pipelined, allows you to request an address, and, before the data comes back, request a new address. After a while, the data from the first address would come back, followed by the data from the second.

That alone makes apples-to-apples comparisons hard.


With original DRAM you had the row strobe, the column strobe and then you could read out your data. You then had to precharge the row again before doing another access. Page mode DRAM improved on this by allowing you to keep the row open and read multiple columns, but you had to wait until the read completed before presenting the next column address. Enhanced-data-out (EDO) DRAM extends this by allowing you to pipeline accesses with the row.

So the ability to pipeline is much older than DDR. DDR SDRAM is an evolution on SDRAM, which is the first variety of DRAM that is actually clocked (and came after EDO); and the main innovation is transfer on both edges of the clock (falling as well as rising) - hence double-data-rate.


Ah, thank you for that! So you could fetch 1/(60ns) bytes per second synchronously, then (assuming the CPU could run in a loop that tightly)?


No; because after the read, you need to precharge the row again (due to the design of dynamic RAM).

ref. https://en.wikipedia.org/wiki/Dynamic_random-access_memory#M...

The quoted "50 ns" DRAM has a read-cycle time of 84 ns.


Oh, interesting. I’ll read up more on that.


Any chance you can switch the hash table to use probing rather than chaining? Probing is way better for locality because on collisions you will just look at the next element in ram.


This table is not well suited for that, but it did get me interested in re-structuring it. Thanks for the suggestion!


Hash function quality varies a lot. Especially if you are memory-limited, it can help to use a round-reduced cryptographic hash to rule out unnecessary collisions. Example: Speck (= add/rotate/xor), but just reducing the number of rounds. And if that is again too expensive on the CPU side, we can always vectorize by inserting or looking up batches of keys at a time.


Small tangent. Would you mind sharing how you calculate CPI? Do you just do timing and use some base clock rate? With the CPU frequency being so variable (with boost clocks) I imagine the correct way is with some fancy instrumentation.


Perhaps cachegrind or linux perf? CPU frequency being variable is not an issue because you can disable turbo boost and limit the frequency to something that the CPU will not throttle. You can also pin the core with taskset to avoid migration. cset shield can prevent the CPU from executing other tasks for latency measurement (although you will still experience some jitter due to interrupts).


> We seem to be getting to the point that CPUs are no longer I/O bound.

These memory modules are for server and HPC use, where memory bandwidth is still a major limitation for many workloads.

Your desktop CPU may not be memory I/O bound for your average single-task desktop use case, but a 128-core server running intense workloads or doing HPC can definitely be I/O bound.

Even desktop gaming applications show benefits from memory overclocking in most cases, so it’s not something that can be dismissed.


> Your desktop CPU may not be memory I/O bound for your average single-task desktop use case, but a 128-core server running intense workloads or doing HPC can definitely be I/O bound.

Indeed. From my measurements (very unscientific), processes stall first for IO, then for CPU resources, then from memory.


IO is often still the bottleneck. Although the bandwidth is larger, main memory access latency is still high, that's why AMD's 3D cache can enable performance improvement by just providing more cache.


RAM latency is still an issue though and often the actual bottleneck.


Deep Learning training is usually I/O bound these days. Definitely on GPU side, I believe the same thing on CPUs. Not clear if many people are using CPUs for the DL training, though. CPUs are used mostly for the inference, when response time is not critical and models are large (with large RAM requirement).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: