This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
mission:log:2011:01:24:4k-partition-alignment-primer [2012-02-18 23:16] – chrono | mission:log:2011:01:24:4k-partition-alignment-primer [2013-06-30 12:50] (current) – [Conclusion] chrono | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== 4k HDD Partition Alignment Primer ====== | ||
+ | Although HDD storage densities have increased dramatically over the years, one of the most elemental aspects of hard disk drive design, the logical block format size known as a sector, has remained constant. Beginning in late 2009, accelerating in 2010 and hitting mainstream in 2011, all major manufacturers are migrating away from the legacy sector size of 512 bytes to a larger, more efficient sector size of 4096 bytes, generally referred to as 4k or AF (Advanced Format). | ||
+ | |||
+ | While researching the benefits and consequences of a 512->4k transition, many reports of " | ||
+ | |||
+ | ===== ===== | ||
+ | |||
+ | ===== Test Setup ===== | ||
+ | |||
+ | The following test setup was used to verify the partition misalignment impact introduced by 4k and to evaluate the performance/ | ||
+ | |||
+ | * MS-Tech LP-06U USB 3.0 PCIe (1x) Controller (NEC Chip)[(As of Jan 2011 there is only one USB 3.0 Controller available, the NEC D720200.)] | ||
+ | * Sharkoon SATA QuickPort Duo USB 3.0 V2[(The Sharkoon USB 3.0 to SATAII Bridge (JMicron chip) has two slots for two drives. Until now it is impossible to use two disks at the same time. As soon as two disks are inserted linux will see either a random one or none at all. For all tests only one drive was inserted at a time, the other slot was left empty.)] | ||
+ | * Western Digital Caviar Green 2TB Desktop (WD20EARS-00MVWB0)[(There are currently two versions of this model out there, this 3x667GB platter Version and another (older) one with 4x500GB platter instead. The 4 Platter disk is supposed to be much slower, to identify it, have a look onto the bottom, the casing should not be indented to leave more room for the 4th platter.)] | ||
+ | * Platters: 3x 667GB | ||
+ | * Cache: 64MB | ||
+ | * Speed: Dynamic 5400 - 7200 RPM (IntelliPower) | ||
+ | * Max Power Rating: 5V@0.7A / 12V@0.55A | ||
+ | * Weight: 0.730Kg | ||
+ | * Production Date: Sep 2010 | ||
+ | * Made in: Malaysia | ||
+ | * Samsung SpinPoint EcoGreen F4 2TB (HD204UI)[(Samsungs HD204UI default firmware suffers from a very nasty bug, if the drive receives an " | ||
+ | * Platters: 3x 667GB | ||
+ | * Cache: 32MB | ||
+ | * Speed: 5400 RPM | ||
+ | * Max Power Rating: 5V@0.85A / 12V@0.5A | ||
+ | * Weight: 0.650Kg | ||
+ | * Production Date: Dec 2010 | ||
+ | * Made in: China | ||
+ | * Gentoo-Kernel 2.6.37 (non-genkernel)[(USB 3.0 performs pretty well with 2.6.37 but it is a pain with anything older due to power management issues that can freeze the entire system when the usb device is woken up)] | ||
+ | |||
+ | ~~REFNOTES~~ | ||
+ | |||
+ | |||
+ | |||
+ | ===== Partition Alignment ===== | ||
+ | |||
+ | So why is 4k sector size such a bad thing that we have to take care of partition alignment all of a sudden with rotational disks too? In fact it's not. The culprit here is the 512-byte emulation the industry was forced to implement so that some OS like Microsoft Windows can handle the disks at all. | ||
+ | |||
+ | Both disk drives in this test do have a physical sector size of 4k but present 512-byte physical sectors to the OS, so degraded performance will result when the drive' | ||
+ | |||
+ | The key to misalignment lies in the partition table which consumes either 512 byte (LBA 0) in case of a legacy msdos type mbr or LBA0-33 for the Primary GUID Partition Table (GPT). | ||
+ | |||
+ | <WRAP round important> | ||
+ | |||
+ | In order to align the 4k logical block with the physical 4k on the platter the sectors following the partition table have to be left empty until a sector is reached that is divisible by 8 (sector 8 for msdos and sector 40 for GPT). | ||
+ | |||
+ | Until the situation will fix itself in the future, when the industry finally manages to let go of the 512-byte compatibility in favor of native 4k, people need to be aware of this issue, otherwise they might experience heavy performance impacts by partitioning the disks like they were used to, without proper alignment. | ||
+ | |||
+ | If you want to test your own hardware to verify these results, go ahead with the next section: | ||
+ | ==== Benchmark Code ==== | ||
+ | |||
+ | The following code was offered by no.op on gentoo-forums [[http:// | ||
+ | |||
+ | <sxh c> | ||
+ | #define _FILE_OFFSET_BITS 64 | ||
+ | |||
+ | #include < | ||
+ | #include < | ||
+ | #include < | ||
+ | #include < | ||
+ | #include < | ||
+ | |||
+ | unsigned char buffer[4096]; | ||
+ | |||
+ | int main(int argc, char **argv) | ||
+ | { | ||
+ | int fd; | ||
+ | int opt; | ||
+ | off_t off; | ||
+ | off_t base = 0; | ||
+ | off_t stride = sizeof buffer; | ||
+ | | ||
+ | const char *device = NULL; | ||
+ | int do_sync = 0; | ||
+ | |||
+ | while ((opt = getopt(argc, | ||
+ | switch (opt) { | ||
+ | case ' | ||
+ | | ||
+ | | ||
+ | case ' | ||
+ | base = atoll(optarg) * 512; | ||
+ | | ||
+ | case ' | ||
+ | | ||
+ | | ||
+ | case ' | ||
+ | size = atoll(optarg); | ||
+ | | ||
+ | case ' | ||
+ | | ||
+ | | ||
+ | default: | ||
+ | | ||
+ | "[-b <base sector>] [-i < | ||
+ | "[-s < | ||
+ | | ||
+ | } | ||
+ | } | ||
+ | if (device == NULL) { | ||
+ | fprintf(stderr, | ||
+ | return 1; | ||
+ | } | ||
+ | fd = open(device, | ||
+ | if (fd < 0) { | ||
+ | perror(" | ||
+ | return 1; | ||
+ | } | ||
+ | |||
+ | |||
+ | off = base; | ||
+ | | ||
+ | " | ||
+ | device, base / 512ll, (long long)stride, | ||
+ | do_sync ? " | ||
+ | while (size > 0) { | ||
+ | if (lseek(fd, off, SEEK_SET) == (off_t)-1) { | ||
+ | | ||
+ | | ||
+ | | ||
+ | } | ||
+ | if (write(fd, buffer, sizeof buffer) != sizeof buffer) { | ||
+ | | ||
+ | | ||
+ | | ||
+ | } | ||
+ | if (do_sync) | ||
+ | if (fdatasync(fd) < 0) { | ||
+ | perror(" | ||
+ | close(fd); | ||
+ | return 1; | ||
+ | } | ||
+ | off += stride; | ||
+ | size -= sizeof buffer; | ||
+ | } | ||
+ | | ||
+ | | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | ==== Compile ==== | ||
+ | |||
+ | <code bash> | ||
+ | CFLAGS=" | ||
+ | </ | ||
+ | |||
+ | ==== Run benchmark ==== | ||
+ | |||
+ | <WRAP round alert> | ||
+ | The following tests **will erase the complete disk**. | ||
+ | Only continue when you feel confident that you know what you are doing. Take special care to check that your hdd really is /dev/sda or change the -d option according to your own setup! | ||
+ | </ | ||
+ | |||
+ | <code bash> | ||
+ | $ time ./ | ||
+ | </ | ||
+ | |||
+ | <code bash> | ||
+ | part-align-bench: | ||
+ | |||
+ | real 1m20.195s | ||
+ | user 0m0.172s | ||
+ | sys 0m6.973s | ||
+ | </ | ||
+ | |||
+ | <code bash> | ||
+ | $ time ./ | ||
+ | </ | ||
+ | |||
+ | <code bash> | ||
+ | part-align-bench: | ||
+ | |||
+ | real 7m51.229s | ||
+ | user 0m0.660s | ||
+ | sys 0m34.763s | ||
+ | </ | ||
+ | ==== Results ==== | ||
+ | |||
+ | ^Device^Sector 0^Sector 8^Sector 34^Sector 40^Sector 42^ | ||
+ | |WD |12.545s|12.436s|65.792s|12.211s|66.341s| | ||
+ | |HD204UI|10.141s|10.153s|59.064s|10.126s|59.010s| | ||
+ | |||
+ | misaligned cp: | ||
+ | |||
+ | morpheus / # time cp -a /usr /mnt/usb/ | ||
+ | |||
+ | real 19m13.331s | ||
+ | user 0m5.086s | ||
+ | sys | ||
+ | morpheus / # du -sch /usr | ||
+ | 13G /usr | ||
+ | 13G total | ||
+ | |||
+ | ==== Conclusion ==== | ||
+ | |||
+ | Looking at the results, the impact of misaligned partitions is clearly visible: | ||
+ | |||
+ | **WD:** | ||
+ | |||
+ | * aligned: 83MB/s | ||
+ | * misaligned: 15.5MB/s | ||
+ | |||
+ | **Samsung: | ||
+ | |||
+ | * aligned: 102MB/s | ||
+ | * misaligned: 17MB/s | ||
+ | |||
+ | <WRAP round important> | ||
+ | |||
+ | Alignment: | ||
+ | |||
+ | No partition table, write filesystem directly to /dev/sda starting at sector 0 | ||
+ | GPT partition table, one partition starting at sector 40 | ||
+ | GPT partition table, multiple partitions, start first at sector 40 following partitions must start at sectors that can be divided by 8. | ||
+ | |||
+ | The msdos partition table/mbr is 512bytes long, so theoretically sector 8 would be the start sector for the first partition. | ||
+ | |||
+ | |||
+ | ==== Create Partition ==== | ||
+ | |||
+ | create gpt partition table: | ||
+ | |||
+ | <code bash> | ||
+ | $ parted --script /dev/sda mklabel gpt | ||
+ | </ | ||
+ | |||
+ | align primary partition to sector 40 for best performance: | ||
+ | |||
+ | <code bash> | ||
+ | $ time parted --align=min --script /dev/sda mkpart primary 40s 100% | ||
+ | |||
+ | real 0m0.021s | ||
+ | user 0m0.002s | ||
+ | sys 0m0.001s | ||
+ | |||
+ | $ parted --script /dev/sda unit s print | ||
+ | Model: WDC WD20 EARS-00MVWB0 (scsi) | ||
+ | Disk /dev/sda: 3907029168s | ||
+ | Sector size (logical/ | ||
+ | Partition Table: gpt | ||
+ | |||
+ | Number | ||
+ | | ||
+ | </ | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | ===== File system Alignment ===== | ||
+ | |||
+ | <code bash> | ||
+ | time mkfs.ext4 -v -b 4096 -E stride=128, | ||
+ | |||
+ | |||
+ | mke2fs 1.41.12 (17-May-2010) | ||
+ | fs_types for mke2fs.conf resolution: ' | ||
+ | Calling BLKDISCARD from 0 to 2000398893056 failed. | ||
+ | Filesystem label= | ||
+ | OS type: Linux | ||
+ | Block size=4096 (log=2) | ||
+ | Fragment size=4096 (log=2) | ||
+ | Stride=128 blocks, Stripe width=128 blocks | ||
+ | 122101760 inodes, 488378636 blocks | ||
+ | 24418931 blocks (5.00%) reserved for the super user | ||
+ | First data block=0 | ||
+ | Maximum filesystem blocks=4294967296 | ||
+ | 14905 block groups | ||
+ | 32768 blocks per group, 32768 fragments per group | ||
+ | 8192 inodes per group | ||
+ | Superblock backups stored on blocks: | ||
+ | 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, | ||
+ | 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, | ||
+ | 102400000, 214990848 | ||
+ | |||
+ | Writing inode tables: done | ||
+ | Writing superblocks and filesystem accounting information: | ||
+ | |||
+ | This filesystem will be automatically checked every 39 mounts or | ||
+ | 180 days, whichever comes first. | ||
+ | |||
+ | real 9m27.444s | ||
+ | user 0m1.696s | ||
+ | sys 0m40.024s | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | <code bash> | ||
+ | |||
+ | bonnie++ -d /mnt/usb/ -u root | ||
+ | |||
+ | Version | ||
+ | Concurrency | ||
+ | Machine | ||
+ | morpheus | ||
+ | Latency | ||
+ | Version | ||
+ | morpheus | ||
+ | files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP | ||
+ | 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ | ||
+ | Latency | ||
+ | </ | ||
+ | |||
+ | tuned mount options: | ||
+ | |||
+ | < | ||
+ | |||
+ | </ | ||
+ | |||
+ | <code bash> | ||
+ | Version | ||
+ | Concurrency | ||
+ | Machine | ||
+ | morpheus | ||
+ | Latency | ||
+ | Version | ||
+ | morpheus | ||
+ | files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP | ||
+ | 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ | ||
+ | Latency | ||
+ | 1.96, | ||
+ | s, | ||
+ | </ | ||
+ | |||
+ | default mkfs.ext4 und mount with no options | ||
+ | |||
+ | <code bash> | ||
+ | Version | ||
+ | Concurrency | ||
+ | Machine | ||
+ | morpheus | ||
+ | Latency | ||
+ | Version | ||
+ | morpheus | ||
+ | files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP | ||
+ | 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ | ||
+ | Latency | ||
+ | 1.96, | ||
+ | </ | ||
+ | |||
+ | mkfs.ext4 -v -b 4096 -m 0 -E stride=16, | ||
+ | |||
+ | <code bash> | ||
+ | Version | ||
+ | Concurrency | ||
+ | Machine | ||
+ | morpheus | ||
+ | Latency | ||
+ | Version | ||
+ | morpheus | ||
+ | files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP | ||
+ | 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ | ||
+ | Latency | ||
+ | 1.96, | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | Largefile support | ||
+ | |||
+ | !!! very fast mkfs !!! | ||
+ | |||
+ | |||
+ | < | ||
+ | morpheus / # time mkfs.ext4 -v -m 0 -T largefile4 -O ^has_journal, | ||
+ | mke2fs 1.41.12 (17-May-2010) | ||
+ | fs_types for mke2fs.conf resolution: ' | ||
+ | Calling BLKDISCARD from 0 to 2000398893056 failed. | ||
+ | Filesystem label= | ||
+ | OS type: Linux | ||
+ | Block size=4096 (log=2) | ||
+ | Fragment size=4096 (log=2) | ||
+ | Stride=0 blocks, Stripe width=0 blocks | ||
+ | 476960 inodes, 488378636 blocks | ||
+ | 0 blocks (0.00%) reserved for the super user | ||
+ | First data block=0 | ||
+ | Maximum filesystem blocks=4294967296 | ||
+ | 14905 block groups | ||
+ | 32768 blocks per group, 32768 fragments per group | ||
+ | 32 inodes per group | ||
+ | Superblock backups stored on blocks: | ||
+ | 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, | ||
+ | 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, | ||
+ | 102400000, 214990848 | ||
+ | |||
+ | Writing inode tables: done | ||
+ | Writing superblocks and filesystem accounting information: | ||
+ | |||
+ | This filesystem will be automatically checked every 33 mounts or | ||
+ | 180 days, whichever comes first. | ||
+ | |||
+ | real 0m19.368s | ||
+ | user 0m0.089s | ||
+ | sys | ||
+ | </ | ||
+ | |||
+ | mount -o noatime, | ||
+ | |||
+ | wd: | ||
+ | |||
+ | < | ||
+ | Version | ||
+ | Concurrency | ||
+ | Machine | ||
+ | morpheus | ||
+ | Latency | ||
+ | Version | ||
+ | morpheus | ||
+ | files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP | ||
+ | 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ | ||
+ | Latency | ||
+ | 1.96, | ||
+ | </ | ||
+ | |||
+ | samsung | ||
+ | |||
+ | < | ||
+ | Version | ||
+ | Concurrency | ||
+ | Machine | ||
+ | morpheus | ||
+ | Latency | ||
+ | Version | ||
+ | morpheus | ||
+ | files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP | ||
+ | 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ | ||
+ | Latency | ||
+ | 1.96, | ||
+ | </ | ||
+ | |||
+ | ==== USB Mass Storage Tuning ==== | ||
+ | |||
+ | Linux 2.6 gives you the ability to see and to change the max_sectors value for each USB storage device, independently. Assuming you have a sysfs filesystem mounted on /sys and assuming /dev/sda is a USB drive, you can see the max_sectors value for /dev/sda simply by running: | ||
+ | |||
+ | <code bash> | ||
+ | $ cat / | ||
+ | </ | ||
+ | |||
+ | and you can set max_sectors to 2048 by running (as root): | ||
+ | |||
+ | <code bash> | ||
+ | $ echo 2048 > / | ||
+ | </ | ||
+ | |||
+ | Values should be positive multiples of 8 (16 on the Alpha and other 64-bit platforms). There is no upper limit, but you probably shouldn' | ||
+ | |||
+ | |||
+ | wd: | ||
+ | |||
+ | < | ||
+ | Version | ||
+ | Concurrency | ||
+ | Machine | ||
+ | morpheus | ||
+ | Latency | ||
+ | Version | ||
+ | morpheus | ||
+ | files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP | ||
+ | 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ | ||
+ | Latency | ||
+ | 1.96, | ||
+ | </ | ||
+ | |||
+ | samsung: | ||
+ | |||
+ | < | ||
+ | |||
+ | Version | ||
+ | Concurrency | ||
+ | Machine | ||
+ | morpheus | ||
+ | Latency | ||
+ | Version | ||
+ | morpheus | ||
+ | files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP | ||
+ | 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ | ||
+ | Latency | ||
+ | 1.96, | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== USB 3.0 Performance ===== | ||
+ | |||
+ | samsung | ||
+ | |||
+ | write: | ||
+ | |||
+ | <code bash> | ||
+ | $ time dd if=/ | ||
+ | 10000000+0 records in | ||
+ | 10000000+0 records out | ||
+ | 40960000000 bytes (41 GB) copied, 369.997 s, 111 MB/s | ||
+ | |||
+ | real 6m10.016s | ||
+ | user 0m1.317s | ||
+ | </ | ||
+ | |||
+ | read: | ||
+ | |||
+ | <code bash> | ||
+ | $ time dd if=/ | ||
+ | 625000+0 records in | ||
+ | 625000+0 records out | ||
+ | 40960000000 bytes (41 GB) copied, 312.399 s, 131 MB/s | ||
+ | |||
+ | real 5m12.446s | ||
+ | user 0m0.252s | ||
+ | sys 0m34.811s | ||
+ | </ | ||
+ | |||
+ | |||
+ | {{tag> | ||
+ | |||
+ | |||
+ | ~~DISCUSSION~~ |