How to Use a Disk Throughput Tester to Diagnose Slow Storage

Disk Throughput Tester: Tools, Methodology, and Best Practices

Measuring disk throughput accurately is essential for diagnosing storage bottlenecks, validating system performance, and sizing infrastructure. This article covers the tools to use, a step-by-step methodology for reliable results, and practical best practices to make your measurements meaningful and repeatable.

Key Concepts

  • Throughput: The volume of data transferred per second (typically MB/s or GB/s).
  • IOPS: Input/output operations per second; important for small-random workloads.
  • Sequential vs Random: Sequential reads/writes move contiguous blocks and show peak bandwidth; random patterns stress latency and IOPS.
  • Block Size (I/O size): Affects throughput—larger blocks favor higher throughput, smaller blocks increase IOPS demand.
  • Queue Depth: Number of outstanding I/O requests—higher depths can improve throughput on devices that support concurrency.
  • Read vs Write: Some storage performs differently for reads and writes; test both.
  • Warm vs Cold Cache: Cached hits inflate numbers; ensure you measure both cached and uncached conditions.

Recommended Tools

  • fio — Flexible I/O tester: supports many workloads, scripting, and output formats.
  • dd — Simple sequential read/write checks (useful for quick sanity checks).
  • iozone — Filesystem and file I/O benchmark with varied test types.
  • bonnie++ — Filesystem benchmark focusing on large-file operations.
  • CrystalDiskMark — GUI for Windows, easy sequential/random tests.
  • perf or blktrace (Linux) — For low-level tracing and deeper analysis.

Test Environment Preparation

  1. Isolate the device: Unmount filesystems or run tests on raw block devices when possible to avoid filesystem effects unless you intend to measure filesystem performance.
  2. Ensure reproducible state: Reboot or flush caches between test sets when needed.
  3. Disable background jobs: Stop backups, indexing, antivirus scans, and other I/O-heavy services.
  4. Record system specs: CPU, RAM, OS, kernel version, storage controller, device model, firmware, RAID config.
  5. Measure baseline idle: Capture baseline I/O and CPU while idle (iostat, vmstat, top).

Methodology — Step-by-Step with fio (recommended)

Assumption: Linux environment, device at /dev/sdx. Adjust paths and sizes for your setup.

  1. Prepare a test file or raw device

    • For raw device: ensure it’s not mounted and you have backups.
    • For file tests: create a file of appropriate size (≥ 2× RAM) to avoid caching.
  2. Test sequential read

    • fio job example:

      Code

      [seq-read] rw=read bs=1M ioengine=libaio direct=1 size=10G runtime=60 numjobs=1 groupreporting
    • Run multiple times, increasing numjobs and queue depth to see scaling.
  3. Test sequential write

    • Same as read but rw=write. For safety, use a disposable device/file.
  4. Test random read/write (small block)

    • Typical settings:

      Code

      rw=randread bs=4k iodepth=32 size=10G runtime=60 numjobs=4 direct=1
    • Repeat for randwrite and mixed (rw=randrw with rwmixread=70).
  5. Vary parameters systematically

    • Block sizes: 4k, 16k, 64k, 256k, 1M.
    • Queue depths: 1, 4, 8, 16, 32, 64.
    • Number of jobs: 1, 2, 4, 8.
  6. Record metrics

    • Throughput (MB/s), IOPS, average/median/max latency, 99th/99.9th percentile latencies, CPU utilization.
  7. Post-test validation

    • Verify no residual caching effects, check device SMART data, and compare results to vendor specs.

Interpreting Results

  • High sequential MB/s close to device spec indicates bandwidth is saturated.
  • High IOPS with low latencies for small-block random tests indicates good transactional performance.
  • Rising tail latencies (p99/p999) point to congestion or firmware issues even if average latency looks fine.
  • If throughput doesn’t scale with increased queue depth or jobs, controller, driver, or device limits may be present.

Common Pitfalls

  • Testing with cached I/O (not using direct I/O) — inflates numbers.
  • Using test file smaller than RAM — measures cache, not disk.
  • Running on mounted filesystem without accounting for filesystem effects.
  • Single-run conclusions — variability requires multiple runs.
  • Ignoring mixed-workload patterns that reflect real usage.

Best Practices

  • Use direct I/O (direct=1 in fio) to bypass page cache when measuring raw device performance.
  • Make test file size ≥ 2× RAM for file-based tests.
  • Run each test multiple times and report median plus variance.
  • Include latency percentiles (p95, p99, p999) alongside throughput.
  • Test real-world workload profiles (mixtures of read/write, burstiness, and think time).
  • Automate and script tests for consistency (bash, Ansible, or CI pipelines).
  • Compare with vendor specs and document firmware/driver versions.
  • Use monitoring (iostat, blktrace) concurrently to spot bottlenecks outside the disk (CPU, network, controller).
  • For cloud disks, test across instance types and AZs, and expect noisy neighbors—report ranges.

Example Report Structure

  • Test objective and environment specs
  • Tool and exact command lines used
  • Test matrix (block sizes, qdepths, jobs) in a table
  • Results: throughput, IOPS, latency percentiles per test (table or CSV)
  • Analysis: bottlenecks and actionable recommendations
  • Reproducibility notes and next steps

Quick Reference fio Command Examples

  • Sequential read:

    Code

    fio –name=seq-read –rw=read –bs=1M –size=10G –direct=1 –ioengine=libaio –runtime=60 –numjobs=1 –groupreporting
  • Random 4k mixed:

    Code

    fio –name=randmix –rw=randrw –bs=4k –rwmixread=70 –size=10G –iodepth=32 –numjobs=4 –direct=1 –runtime=60 –group_reporting

Conclusion

Accurate disk throughput measurement combines the right tools, a controlled methodology, and disciplined reporting. Use fio for flexible, scriptable tests, vary block sizes and queue depths to reveal different bottlenecks, record latency percentiles, and repeat tests to ensure reliability. Document environment and commands so results are reproducible and actionable.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *