Coding for SSDs

danielvinson · on April 24, 2015

As somebody who worked on SSDs here is my commentary:

1. Most of the recommendations are things that will have little to no benefit on any modern SSD because all of this is handled for you in the firmware. When SSD firmware is not perfectly optimizing your performance, it is doing so to extend the life of the drive. Take this with a grain of salt though since many SSD manufacturers use really bad code provided by the controller manufacturer (SandForce notably provided awful stock firmware).

2. SSD benchmarks commentary is mostly accurate - since performance is very motherboard chipset-dependent, we chose the best ones to publish. Can't really blame anyone, but it means that if you bought the drive you are incredibly unlikely to achieve identical performance.

3. "Benchmarks are hard". No, not really. I wrote most of the benchmarks we used for linux testing. The reason benchmarks could be considered difficult is because all of the internal benchmarks that are used were written in-house and not released to the public. Reviewers never really had the right tools.

4. Enterprise SSDs and Consumer SSDs are drastically different. If you want consistent high performance, get an enterprise PCI-X drive. NAND quality is the major difference... and good NAND is hard to find. Generally an enterprise SSD will also have a significantly faster controller, and on larger drives, will contain more than one controller to avoid straining resources.

cjensen · on April 24, 2015

Pretty sure you meant "PCI-Express" (or "PCIe") not "PCI-X". PCI-X is the name of an obsolete bus type.

Totally agree with your view of the advice given.

For example, TRIM has surprisingly little effect because modern SSDs (Micron) will move cold data to worn cells so that all cells participate in wear leveling. TRIM still saves you some, just nowhere near what you might expect.

DannoHung · on April 24, 2015

So, unlike doty said below, we shouldn't treat SSDs like a faster form of spinning rust?

Where do we find real advice about how to best utilize SSD?

baruch · on April 24, 2015

Unless you are optimizing for a specific SSD with known quantities there is no real way to do that.

The high-level optimizations are important: * Choose a good SSD * Read and Write in "page" multiples and "page" aligned * Use lots of parallel IOs (high queue depth) * Do not put unrelated data in the same "page"

A page used to be 4KB, SSDs are now switching to 8KB and will switch to 16KB later on. Just pick a reasonable size around that (16KB if you can do it will last you a while). Don't sweat the page multiples too much, the SSDs will most likely have to handle 4KB pages for a long while due to databases and such so they will keep some optimization around that size anyhow, it will make it easier for them if you use a larger size.

I wouldn't heed any of the advice on single-threading, the biggest performance boost comes from parallelism and writes are anyway buffered by the SSD (a good SSD has a super-cap to have a good sized write cache).

baruch · on April 24, 2015

There is unfortunately no good place to do that.

I work in an enterprise setting and get to talk to the ssd vendors and fish for details on how to best use their ssds and I still find it hard to get full information. Sometimes I wonder if there can even be a single compendium of knowledge about ssds. There are just so many moving parts in there that I doubt the ssd vendors themselves know what is really happening. Many times there are strange behaviors that we hit and then the vendors get to debug their ssds and explain new things about what just happened.

One effect that I can attest to is that I know quite a bit about the ssd inner-workings (or at least how it shows to the outside world, never saw an ssd firmware source line) but it would be very hard for me to sit down and write something about all aspects. I can however answer questions when needed.

joshuacc · on April 24, 2015

> There are just so many moving parts

I found this amusing in the context of SSDs versus spinning rust. :-D

baruch · on April 24, 2015

Indeed :-)

If you really want to be pedantic, there are electrons moving in there so it is solid state but it has moving parts :-)

danielvinson · on April 24, 2015

You can't really optimize for SSDs in general. You CAN optimize for a specific brand/model/controller on a certain chipset. If an optimization in general is possible, you can be sure that the SSD manufacturer already is utilizing it in firmware on higher-end drives.

doty · on April 24, 2015

What's interesting to me is that if you read the summary (http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-s...), most of the recommendations are exactly the same as when programming against a traditional spinning-oxide hard drive: read and write entire blocks if you can, combine block writes, &c.

It's nice to know that all that work on high-performance IO in databases doesn't need to be thrown away just yet.

jandrewrogers · on April 24, 2015

SSDs do behave like complex block devices from the perspective of a database engine. However, SSDs are sufficiently different from spinning disk that optimal I/O patterns are not the same. Some high-performance database I/O schedulers have separate storage behavioral modes (approximately based on common SSD and HDD behavior) depending on the storage characteristics.

Everything on a computer approximates a block device. Even RAM is treated as a type of complex block device in sophisticated, high-performance databases, because it is. Scheduling operations to optimally match the characteristics of block devices is a (the?) primary optimization mechanism.

wicknicks · on April 24, 2015

The fact that we can do parallel reads from an SSD does give more flexibility to DB developers.

pixl97 · on April 24, 2015

Exactly. Q=32 Random 4k IOP reads are much, much faster than a queue depth of 1 on SSDs. On regular hard drives you might see a 200% speedup with a high queue depth, on SSDs you could see 1000-3000% speedup.

baruch · on April 24, 2015

SSDs live off their internal parallelism, if you aren't using a Queue>16 you are not making real use of an SSD. Q=32 or even Q=64 are usually the right settings.

It all depends though if you want throughput or latency. If you really really care about latency you should balance the queue depth and probably use it between 16 and 32. The reason is that with higher queues you get more collisions on the same die and then latency suffers. There are read-read, write-read, erase-write and all the other combinations but those three are the interesting ones.

nqzero · on April 24, 2015

this is true because the NAND has been crippled by imposing the FTL on it. if you could bypass the FTL, then the programming advice would be quite different, and if properly applied result in much better long term performance

kabdib · on April 24, 2015

Yeah, you can do some pretty wild things if you have direct access to the flash array.

But then you need to handle stuff like wear leveling, transaction management, ECC and other forms of recovery. And a lot of the stuff you need to do is probably flash-part specific (e.g., read disturbance, probably stuff around channel management and throughput, etc.).

I actually proposed allowing the firmware of a recent consumer product have such access to the flash (because I didn't trust the flash vendor's translation layer), but got shot down. I don't know how that turned out; probably they spent a bunch of time doing qualification (code for: "Fix your damned FTL bugs or we find another vendor. Wait. We don't have time for that. Fix as many as you can, or we'll be mad ... or something. Here, have some money.").

pkaye · on April 24, 2015

In what way would you do it differently than the FTL. Are you familiar with all the restrictions and limitations of using NAND and how it impacts the FTL algorithms?

nqzero · on April 24, 2015

it's less that i'd do things not currently done in the FTL, and more that it would be deterministic. my algorithm is a cyclic cache that's write heavy, at least initially targeted at low end hardware. if i could bypass the FTL, i could ensure that my algorithm wasn't amplifying. but with FTL, which varies from drive to drive, my usage pattern could result in a great deal of amplification

i'm sure that for any given controller's FTL (and this article claims that there really are only a couple on the market), i could tweak my algorithm to work reasonably well. but that's a sign of a leaky abstraction

i'd also like access to the small SLC portion of the drive, though i'm working around that for now with journaling

i'm not an expert in flash memory. my model is basically a block device with larger block erasure, and that the number of erasures each block can handle is limited

barrkel · on April 24, 2015

Part 6 in particular is a decent distillation of stuff to keep in mind that should apply across SSDs from multiple vendors. It would be nice to have benchmarks that validate the individual assertions, of course.

There's a lot of complexity going on behind the scenes of modern SSDs. Simplistic benchmarks as posted on most review sites often don't address real-world performance. The ones I pay most attention to are things like BootRacer; it's quite remarkable how often a drive will be slower at booting Windows than a competitor, even when it beats them in simple metrics like sequential / random reads / writes with different queue depths.

I can see the potential for firmware being optimized for modalities that crop up often in benchmarks but less often elsewhere: the aforementioned sequential / random reads and writes with different block sizes and queue depths. Doing well on benchmarks probably drives a bunch of sales. But detecting and switching modality may not be latency-free, and shortcuts taken to improve absolute performance in those modalities may harm global performance in a real-world IO mix.

danielvinson · on April 24, 2015

BootRacer tests are really inaccurate. The problem is that boot times are dependent on many things outside of the SSD, namely Chipset, drivers, and BIOS settings. Testing methodology for the manufacturer would be to disable all features in the BIOS for this test to try to reduce variables, which leads to unclear lower numbers.

barrkel · on April 25, 2015

The first rule of benchmarks is to only change one thing at a time. Any test that changed more than just the drive would not be accurate.

pixl97 · on April 24, 2015

Wouldn't it make sense to keep the same hardware for all of one series of tests except the SSD?

polskibus · on April 24, 2015

Do you have any links that would show that an SSD was tuned solely for benchmark purposes? I know people have found such things in GPU market but I've never heard about such claims in SSD market.

Kurtz79 · on April 24, 2015

"Back in the days many SSD vendors were only focusing on high peak performance, which unfortunately came at the cost of sustained performance. In other words, the drives would push high IOPS in certain synthetic scenarios to provide nice marketing numbers, but as soon as you pushed the drive for more than a few minutes you could easily run into hiccups caused by poor performance consistency. Once we started exploring IO consistency, nearly all SSD manufacturers made a move to improve consistency."

http://www.anandtech.com/show/9144/crucial-bx100-120gb-250gb...

pixpop · on April 25, 2015

Sorry I can't provide links. But I work for a large tech company, currently qualifying SSDs for a new project. We ran into serious bugs with drives from a particular well-known SSD manufacturer (certain workloads cause drives to become unresponsive until power is cycled) and were able to demand detailed root cause analysis. In at least two cases they uncovered new bugs in code that had been inserted purely to improve the numbers for 4k random writes on certain benchmarks. These are consumer drives, not enterprise, and are current production.

baruch · on April 24, 2015

I've had that much acknowledged from SSD vendors, each one blames the others for the benchmark wars but they all know that benchmarks are out there and customers will chose the SSD based on the published benchmarks so they tune the firmware to make sure they don't look bad in the benchmarks.

I was even told that an SSD was made to work consistently over the life span but drop some of these consistency work (i.e. slow downs at start of life) and only kick them in after so many power-on-hours so that it will not look bad in benchmarks.

It's a fact of life that customers rely on benchmarks and that vendors cannot educate all customers so the masses out there need to see a good benchmark so that's what gets optimized for.

ugexe · on April 24, 2015

I assume he states "potential" because he sees how it could be possible, not because he has proof of it.

rootedbox · on April 24, 2015

"My only regret is not to have produced any code of my own to prove that the access patterns I recommend are actually the best."

sigh