I have, in a server I’ve built, some new Exos x16 drives. These drives are interesting in that they support dynamic switching between 512 byte sectors and 4096 byte sectors - which means that one can actually compare like-for-like performance with sector size!
Why am I doing this? I can’t find the answers, mostly, which means I ought to do a blog post after some bechmarking.
Sector Sizes? Huh?
I would expect some of my readers to be familiar with sector sizes and such (if you’ve been a sysadmin doing aggressive RAID tuning, or if you’ve been down in filesystem weeds for some reason or another), but if you’re not:
Virtually all computer storage works (at the implementation level) in “blocks” of data. Traditionally (ignoring a lot of weird systems I’m going to ignore here), computer hard drives (of the spinning rust variety) have used 512 byte sectors. You address the drive at the sector level, and can read/write one or more, but you can’t change data within that sector without rewriting the whole thing. The OS handles all the abstractions here so you can change a byte in a file without worrying about it, but the disk only speaks blocks.
At some point, to allow for larger drive sizes without having to totally change the limits of the interfaces, some drives started using 4k sectors - 4096 bytes per block, or 8x the traditional size. However, quite a few filesystems in widespread use couldn’t understand the concept of 4k sectors, so the drives typically had some internal emulation for dealing with 512 byte sectors - the OS can talk to the drive with 512 byte sectors, and the disk translates to the actual media. Performance on this varies wildly depending on how closely aligned things are, but in theory, a filesystem should be able to work around this and offer pretty good performance, even if the drive is doing internal emulation.
But, these drives support actually switching how they report - they can either report 512 byte sectors to the OS and internally emulator, or they can report 4k native sectors. Does it actually matter? I didn’t know - so I did the work to find out! And, yes, it does.
Detecting Sector Sizes
Modern hard drives, nicely enough, give you the ability to query them for their sector sizes. There are a few ways to do it, but I like hdparm
, so I’ll use that.
[email protected]:/home/rgraves# hdparm -I /dev/sda | grep Sector\ size
Logical Sector size: 512 bytes [ Supported: 256 2048 ]
Physical Sector size: 4096 bytes
Out of the box, my SATA drives use 4k physical sectors, set up with 512 byte sector emulation. This means that 8 512-byte sectors are crammed into each 4k sector on the disk. If this sounds a tiny bit like flash, with small sectors crammed into large blocks? Well, yeah. That’s what it’s doing. If you write a 512 byte sector, the drive has to read the 4k sector, modify it in cache, and write it back to disk - meaning that there are twice the operations required as just laying down a new 4k sector atomically. That should have some impact on performance. I’ve got massive cache on these drives (256MB per drive), and they’re 7200 RPM non-SMR, so… performance should be up there towards the top of what you can find on spinning rust. Let’s get to benchmarking!
I typically do my disk benchmarking with iozone, and this is no exception - I’m using stable version 490 here, built for 64-bit Linux (linux-AMD64).
Converting Sector Sizes
Switching between sector sizes is the sort of thing that requires manufacturer-specific tools. You’ll want SeaChest Lite to do the job here.
Switching sector sizes is destructive. It totally destroys all data on the drive.
sudo ./SeaChest_Lite_101_1183_64 -d /dev/sg2 --setSectorSize 4096 --confirm I-understand-this-command-will-erase-all-data-on-the-drive
This will take a while to run. Once you’re done (and probably with a reboot in the middle), you should see 4k sector sizes!
ext4 Lazy Init: Watch Out!
If you create a clean ext4 filesystem and mount it, you may notice some write traffic to what should be an absolutely idle disk. It’s… a bit weird, because nothing is touching the disk. lsof shows nothing, so, what is it?
ext4lazyinit
. This late-writes a bunch of inode data to the filesystem after it’s been created, on the first mount. The advantage is that filesystem creation is a ton faster on large volumes. The disadvantage, here, is that it throws off benchmarking.
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.08 0.08 0.00 99.83
Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd
sda 10.00 0.00 5.00 0.00 0 5 0
For benchmarking, you might make your filesystems with a few incantations that say, “No, really, write the whole filesystem.”
sudo mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/sda1
Mount this after creation, and it ought to not be writing anything. It does mean that making the filesystem on 12TB drives takes an awful lot longer…
How big a difference? Making the filesystem with defaults and lazy init takes 25 seconds. Making the same filesystem, with the same disk, writing everything, takes 8m46s - or 21x longer. Quite the difference on a large volume!
512e: Out-of-Box Defaults
For the first tests, I’ve created a single partition on a drive, formatted it ext4, and mounted it. Linux does recognize that the IO size ought to be 4096 bytes on this disk, which is helpful - the partitions are properly offset by a multiple of 8 sectors.
[email protected]:/home/rgraves# fdisk -l /dev/sda
Disk /dev/sda: 10.94 TiB, 12000138625024 bytes, 23437770752 sectors
Disk model: ST12000NM001G-2M
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 1CB60717-61D2-A24D-8434-4981BE8A0586
Device Start End Sectors Size Type
/dev/sda1 2048 23437770718 23437768671 10.9T Linux filesystem
And when you make the filesystem, ext4 defaults to 4k blocks. You can’t actually make an ext4 filesystem with 512 byte blocks - the smallest supported block size is 1024 bytes.
[email protected]:/home/rgraves# mkfs.ext4 /dev/sda1
mke2fs 1.45.5 (07-Jan-2020)
Creating filesystem with 2929721083 4k blocks and 366215168 inodes
Filesystem UUID: 8d1ff5b2-38e3-4dbc-99a3-ba44cc562a4a
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
So, in theory, there should be no difference between 512 byte blocks and 4k blocks, because the filesystem is treating blocks as 4k anyway. Right?
4kn: Native Sectors
After some fiddling about, we’ve got 4k sectors!
[email protected]:~$ sudo hdparm -I /dev/sdc | grep -i Sector
sectors/track 63 63
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 2929721344
Logical Sector size: 4096 bytes [ Supported: 256 2048 ]
Physical Sector size: 4096 bytes
Logical Sector-0 offset: 0 bytes
R/W multiple sector transfer: Max = 16 Current = 16
[email protected]:/home/rgraves# fdisk -l /dev/sdc
Disk /dev/sdc: 10.94 TiB, 12000138625024 bytes, 2929721344 sectors
Disk model: ST12000NM001G-2M
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 8CED002E-6A15-FC4F-A09F-92D494861787
Device Start End Sectors Size Type
/dev/sdc1 256 2929721338 2929721083 10.9T Linux filesystem
One filesystem later…
[email protected]:/home/rgraves# mkfs.ext4 /dev/sdc1
mke2fs 1.45.5 (07-Jan-2020)
Creating filesystem with 2929721083 4k blocks and 366215168 inodes
Filesystem UUID: 923ac932-55d8-4258-83a2-c4595b993150
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
Benchmarking: Basic Tests
For the first set of tests, I’m just going to do my regular iozone disk benchmark - test some write speeds, read speeds, random walk speeds - but with small record sizes.
./iozone -f /mnt/512/-e -I -a -s 2G -r 512 -r 4 -r 1M -i 0 -i 1 -i 2
And the first problem is that you can’t actually do 1k record block writes on 4k drives. Whoops! Something to be learned here, probably…
Error writing block 0, fd= 3
write: Invalid argument
iozone: interrupted
exiting iozone
So, continuing on with the 4k and 1M tests… actually, letting them run. It’s painfully slow.
Benchmark Results
Remember: My theory going in was that the filesystem uses 4k blocks, and everything is properly aligned, so there shouldn’t be a meaningful difference.
Does that hold up? Well, no. Not at all. I wasn’t able to run the 1kb block benchmarks because they didn’t work on the 4k sector drive (reasonably), but using 4kb blocks… there’s an awfully big difference here. This is single threaded benchmarking, but there is consistently a huge lead going to the 4k sector drive here on 4kb block transfers. On all these charts, I’ve expanded the “random” results by 10x so they’re actually visible on the chart. Spinning disks still suck at random IO!
Once the block sizes got larger, to 32kb, performance between the two sector sizes was a lot closer - there’s less of a difference here, sometimes things go the other way (the 512 byte sectors being a hair faster), and in general it doesn’t matter nearly as much.
How about the multithreaded (4 threads) throughput benchmarks? Here, even with 4kb block sizes, there’s just not a huge difference - and, again, where there is a difference, it goes both ways depending on the test.
But those small block accesses… there’s a definite difference. And, of course, an awful lot of accesses to a standard filesystem are small block accesses.
A more comprehensive test suite might have benchmarked all the drives in both modes, averaging results, and might have tested RAID performance with both sector sizes - but as this is a server I was trying to get deployed, I did none of that. I got enough data for my conclusion and went on to deploying the server.
Conclusions: Use 4k Sectors!
As far as I’m concerned, the conclusions here are pretty clear. If you’ve got a modern operating system that can handle 4k sectors, and your drives support operating either as 512 byte or 4k sectors, convert your drives to 4k native sectors before doing anything else. Then go on your way and let the OS deal with it.
Your Own Server?
So… the other obvious question here would be about why on earth I’m benchmarking drives for my server, when, of course, there’s The Cloud. Yeah… anyway. About that.
After a glorious few years of having hosted stuff in Teh Cloudz, which is definitely not a cloud of wildfire smoke, I decided that I really needed to go back to hosting the stuff I cared about on a server I owned, in a local datacenter. This has been the case for the bulk of my adult life, so it’s nothing foreign to me. But the internet is going off in some very weird directions lately, and I’ve come to the conclusion that the cloud is simply more risk than I’m willing to tolerate for my own hosting. While I don’t expect myself to say anything that will get me summerily booted off the internet… I don’t actually keep up with what the whims of that crowd are, and I’d rather not have to deal with keeping up with it. Further, I’m simply no longer interested in routing a lot of funding to the cloud providers. So, local hosting it is once again. It was a nice few years of cloud!
I’ve been striving to use the internet less and less these days, and I’m pretty sure it’s a healthy thing. Most of my use lately is just reading - not engaging via social apps or websites. I communicate what I care to say via this blog, the forum I run (see link at the top of the blog), and a few other forums, though I find my engagement with them dropping over time as they’ve become less “discussion” forums and more “current thought orthadoxy enforcement chambers.” The second just isn’t interesting, nor a good use of my time.
Twenty Years Retrospective
Since I know an awful lot of sites are going to be thinking back 20 years today, I figure I should as well. I was an undergrad. We had a lot of computers in our dorm room - and an awful lot of debate over what ought, properly, to be considered a computer. If you’re going to count the Compaq Portable, which clearly is a computer, then one must also, count the Palm Pilot, which is far more powerful. How about a TI-83? The TI-89 has to count, because it’s also a good bit more powerful than the Compaq…
It was a different time, and our technology was a lot more DIY and hacked - which I certainly think is a good thing. Tux, on a K’Nex toilet? Just another Friday night! The Windows vs Linux wars were in full force, MacOS X was just starting to be a thing (that one often made fun of, as it was… really, really “lickable” in early versions), and virtualization was a mainframe thing, not a normal person thing.
This meant that if you wanted to run a variety of OSes, you needed a variety of computers. Fortunately, older computers were fairly cheap, and the local university surplus sale didn’t help matters any. CRTs were, of course, the option for desktops. So I had a desk that ended up looking something like this:
Yes. The giant OSB box is a computer. I don’t think it was named “root” quite yet, but at some later point, it did manage the Windows domain name of “ROOT” on the local network, and as part of the size was due to the giant rack of hard drives in it, the number of gigabytes of shared content was impressive. I think at one point, I had close to a third a TB of storage in it!
The plywood behemoth was a dual P3 866 (overclocked 650s, tape on… was it pin A13? I can’t actually find it now), 640MB of RAM, dual monitors, SCSI… the works. There was a Mac of some variety to the right of the monitors (running the precariously balanced monitor on top), a Powerbook 500 series, a Palm Pilot, and then the stack.
The dryer tubes on top? Those ran to a fan box over at the window, sucking in cool ambient outside air to chill my corner. There was a lot of heat to deal with over there…
Behind me, wedged in a corner, was The Stack. A pair of Mac SE/30s with network cards, and if I recall properly, some Compaq Proliants - one early Pentium, one 486, running Windows NT 4, doing… something or other. Getting me in academic trouble, mostly, as I recall (if you use an open Apple/Windows fileshare to transfer homework around, other people can find it too).
Those SE/30s, ancient though they were (16MHz 68030s, 8 or 12 MB of RAM, couple hundred megabytes of hard drive) had one of the BSDs (pretty sure it was NetBSD), were my IRC servers, and, of course, a webserver.
Bye-bye...
So I sez to him... The real way
that it should be done is to...
On occasion, people would hotlink images I hosted on them, which of course was strongly discouraged by… well, um… you know how poorly forum software of the time handled huge images? People very quickly got the message about hotlinking.
We also did things like stuff random AMD boards into old SGI Indigo cases - I believe this one made some LAN party appearances. Don’t worry, it was spare parts - I made a working Indigo or two out of a bunch of them, sold those, and used the rest as art. I still have some dead boards from them in my office. This was before I’d made an Indigo 2 my primary desktop (my whole “use obsolete computers as desktops” thing is far from new).
And the internet? Well, it was always-on in the dorms - but it was wired. Wireless wasn’t a thing yet. Even laptops were “networkless while on the go” - the Powerbook 500 series had network, certainly, but only when I was at a desk and plugged in. When I was using it for notes or programming or such, it just ran as a standalone, disconnected box, and this was entirely normal.
We had the early versions of what would be considered “social media” - but they hadn’t yet really grown fangs. LiveJournal was the popular platform among my social group (chronologically ordered pages that ended with a “you have to explicitly load the next page” sort of thing, not an endless scroll), cell phones were fine for making calls, if rather more expensive than landlines (which every dorm room had), and the internet certainly hadn’t become the default intermediator of every aspect of life (as it more or less is today).
In the past 20 years, we’ve gained an awful lot of technical capability, wireless has gone from “almost not a thing” to “the default for most people,” computers have gone multi-core and massively so (my dual P3 was a true rarity back in 2001), storage is vastly larger, but… fundamentally, most of what I use computers for today, I used far lower performance computers for twenty years ago. Talking to people online (AIM and IRC were the big ones, AIM is dead, IRC is unchanged), coding, writing, blogging, hosting web content… we did all of this, just fine, on far slower computers. I mean, I ran an IRC network on a pair of 16MHz in-order processors (though, admittedly, you could do that today - the ircd options are pretty much unchanged).
Since then, a few companies have gotten really, really good at turning monopolized attention into insane torrents of cash (which, indeed, solve most known problems - except the ones caused by torrents of cash). We’ve made cellular data and tiny screens the main way in which people interact with content (being, of course, turned from good consumers of material goods to good consumers of content too - I played plenty of computer games but certainly hadn’t wrapped my identity around being a “gamer” despite showing up at plenty of very sketchy LAN parties), we’ve mostly eliminated keyboards as they way of inputting text (voice is decent, though I’m entirely uncomfortable with remote servers doing the processing of my voice for text extraction and who knows what else extraction, voice analysis for stress and such being a reasonably refined science), and have more or less ruined the concept of attention, focus, and various other useful traits in pursuit of yet more addiction.
And I’m tired of it. So I’m trying to figure out ways to go back about 20 years in terms of my interaction with the modern internet, with technology, etc. I’ll stay in the deep weeds for work, and for personal stuff… I’ve been doing an awful lot of evaluation, and there’s plenty more to come.
Oh. Yeah. And twenty years ago, the Taliban controlled Afghanistan. Today? Nothing has changed, except an awful lot of lives lost, and an awful lot of trillions of dollars flushed down the drain.
This is a companion discussion topic for the original entry at https://www.sevarg.net/2021/09/11/4k-vs-512-sector-size-benchmarking/