ZFS Pool Defragmentation: A Comprehensive Guide

ZFS is a powerful filesystem that offers advanced features like data integrity verification, snapshots, and efficient storage management. However, over time, ZFS pools can become fragmented, which may impact performance. This guide explains what ZFS fragmentation is and provides a systematic approach to defragmentation.

Understanding ZFS Fragmentation

What is ZFS Fragmentation?

In ZFS, fragmentation occurs when free space in a pool becomes broken up into smaller, non-contiguous chunks. This typically happens over time as files are written, deleted, and modified.

The FRAG column in zpool list output shows the percentage of fragmentation:

NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank  5.20T  2.51T  2.70T        -         -    32%    48%  1.00x    ONLINE  -

Why Fragmentation Matters

High fragmentation can impact performance in several ways:

  1. Write Performance: ZFS attempts to allocate contiguous blocks for new writes. With high fragmentation, finding large enough contiguous regions becomes harder.
  2. Sequential Read Performance: Fragmented files may require more disk seeks, particularly impacting spinning disk (HDD) performance.
  3. Space Allocation: Severe fragmentation can sometimes make it difficult to allocate space for large files even when the pool shows adequate free space.

When to Defragment

Consider defragmentation when:

  • Fragmentation exceeds 20-30%
  • You notice degraded write performance
  • Large file allocations fail despite having sufficient free space
  • The pool has gone through cycles of being nearly full then partly emptied

Defragmentation Process

Since ZFS lacks a native "defrag" command, the most effective method involves data migration. Here's a systematic approach:

Prerequisites

  1. Additional Storage: You'll need temporary storage with capacity equal to or greater than your ZFS pool's used space.
  2. Backup: Always have backups before major storage operations.
  3. System Downtime: This process requires pool destruction and recreation.

Step-by-Step Guide

1. Preparation

# Check current fragmentation level
zpool list -o name,size,alloc,free,frag,cap

# Create temporary storage (using an external drive, another pool, etc.)
zpool create temp_pool /dev/sdX
zfs set compression=lz4 temp_pool  # Match original pool settings

2. Data Migration to Temporary Storage

First, install the pv tool for progress monitoring:

apt-get update && apt-get install -y pv

Then create your dataset structure and transfer data:

# Create base dataset structure
zfs create temp_pool/containers
zfs create temp_pool/ct
# Create other top-level datasets as needed
zfs create temp_pool/media_data
zfs create temp_pool/transcodes
zfs create temp_pool/vms
# etc.

# Create subdataset structures as needed
for app in jellyfin lidarr nginx portainer prowlarr qbittorrent radarr sonarr; do
  zfs create -p temp_pool/containers/mediaserver/appdata/$app
done

# Transfer top-level datasets first
for ds in $(zfs list -d 1 -r original_pool -o name -H | grep -v "^original_pool$"); do
  echo "Starting transfer of $ds"
  zfs snapshot $ds@migrate
  zfs send $ds@migrate | pv | zfs receive -F ${ds/original_pool/temp_pool}
  echo "Completed transfer of $ds"
done

# Then transfer child datasets separately
for ds in $(zfs list -r original_pool/containers -o name -H | grep -v "^original_pool/containers$"); do
  child_ds=${ds#original_pool/}
  echo "Starting transfer of $child_ds"
  zfs snapshot $ds@migrate
  zfs send $ds@migrate | pv | zfs receive -F temp_pool/$child_ds
  echo "Completed transfer of $child_ds"
done

# Repeat for other complex hierarchies
for ds in $(zfs list -r original_pool/ct -o name -H | grep -v "^original_pool/ct$"); do
  child_ds=${ds#original_pool/}
  echo "Starting transfer of $child_ds"
  zfs snapshot $ds@migrate
  zfs send $ds@migrate | pv | zfs receive -F temp_pool/$child_ds
  echo "Completed transfer of $child_ds"
done

3. Verification

# Compare dataset sizes
for ds in $(zfs list -r original_pool -o name -H); do
  orig_size=$(zfs get -H -o value used $ds)
  temp_ds=${ds/original_pool/temp_pool}
  temp_size=$(zfs get -H -o value used $temp_ds 2>/dev/null || echo "N/A")
  printf "%-50s %-15s %-15s\n" "$ds" "$orig_size" "$temp_size"
done

# Check total space usage
orig_used=$(zfs get -H -o value used original_pool)
temp_used=$(zfs get -H -o value used temp_pool)
echo "Original pool used: $orig_used"
echo "Temp pool used: $temp_used"

4. Destroy and Recreate Original Pool

# Export pools for safety
zpool export original_pool
zpool export temp_pool

# Import temp_pool as read-only for safety
zpool import -o readonly=on temp_pool

# Import and destroy original pool
zpool import original_pool
zpool destroy -f original_pool

# Recreate with same layout (adjust to match your vdev configuration)
# Use -f flag if you have different-sized drives in a mirror/vdev
zpool create -f original_pool \
  mirror /dev/sda /dev/sdb \
  mirror /dev/sdc /dev/sdd

# Set original pool properties
zfs set compression=lz4 original_pool
zfs set atime=off original_pool  # Add any other properties

If you encounter issues with pool exports, you might need more aggressive measures:

# Force export if needed
zpool export -f original_pool

# If still stuck, you may need to identify processes using the pool
lsof | grep /poolmount

5. Restore Data to Defragmented Pool

# If temp_pool was imported read-only, make it writable again
zpool export temp_pool
zpool import temp_pool

# Create base structure
zfs create original_pool/containers
zfs create original_pool/ct
zfs create original_pool/media_data
# Create other top-level datasets as needed

# Transfer top-level datasets back
for ds in transcodes retrogames vms; do
  echo "Restoring $ds"
  zfs send temp_pool/$ds@migrate | pv | zfs receive -F original_pool/$ds
done

# Transfer container data back
echo "Restoring containers"
zfs send temp_pool/containers@migrate | pv | zfs receive -F original_pool/containers

# Handle subdatasets separately for each major section
for ds in $(zfs list -r temp_pool/containers -o name -H | grep -v "^temp_pool/containers$"); do
  target_ds=${ds#temp_pool/}
  echo "Restoring $target_ds"
  zfs send $ds@migrate | pv | zfs receive -F original_pool/$target_ds
done

# Repeat for other complex hierarchies
for ds in $(zfs list -r temp_pool/ct -o name -H | grep -v "^temp_pool/ct$"); do
  target_ds=${ds#temp_pool/}
  echo "Restoring $target_ds"
  zfs send $ds@migrate | pv | zfs receive -F original_pool/$target_ds
done

# Transfer largest datasets last
echo "Restoring largest datasets"
zfs send temp_pool/media_data@migrate | pv | zfs receive -F original_pool/media_data

#### 7. Cleaning Up

After verifying everything is working properly:

```bash
# Attempt to export the temporary pool
zpool export temp_pool

# If export fails due to busy datasets
zfs list -r temp_pool -o name -H | xargs -I{} zfs set canmount=noauto {}
zpool export -f temp_pool

# If still having issues, a system reboot might be required
# After reboot, import and destroy the temporary pool
zpool import temp_pool
zpool destroy temp_pool

# Start your services, VMs, containers, etc.
# For example, in a Proxmox environment:
pct list | awk 'NR>1 {print $1}' | xargs -I{} pct start {}
qm list | awk 'NR>1 {print $1}' | xargs -I{} qm start {}

## Comprehensive Verification Script

Here's a verification script that checks all aspects of the migration and stops on any errors:

```bash
#!/bin/bash

# Define log file
LOG_FILE="/root/defrag_verification.log"

# Function to log and exit on error
log_and_exit_on_error() {
  echo "ERROR: $1" | tee -a "$LOG_FILE"
  echo "Script stopped due to error at $(date)" >> "$LOG_FILE"
  exit 1
}

# Start verification
echo "=== Starting verification at $(date) ===" > "$LOG_FILE"

# 1. Check if both pools exist
echo "Checking pool existence..." | tee -a "$LOG_FILE"
zpool list original_pool >/dev/null 2>&1 || log_and_exit_on_error "Original pool not found"
zpool list temp_pool >/dev/null 2>&1 || log_and_exit_on_error "Temp pool not found"

# 2. Compare overall pool usage
echo "====== POOL SIZE COMPARISON ======" | tee -a "$LOG_FILE"
zpool list original_pool >> "$LOG_FILE" || log_and_exit_on_error "Failed to get original pool info"
zpool list temp_pool >> "$LOG_FILE" || log_and_exit_on_error "Failed to get temp pool info"
echo "" >> "$LOG_FILE"

# 3. Compare dataset sizes
echo "====== DATASET SIZE COMPARISON ======" | tee -a "$LOG_FILE"
printf "%-50s %-15s %-15s\n" "DATASET" "ORIGINAL SIZE" "NEW SIZE" | tee -a "$LOG_FILE"

# Get list of datasets from temp_pool
TEMP_DATASETS=$(zfs list -r temp_pool -o name -H) || log_and_exit_on_error "Failed to list temp_pool datasets"

for ds in $TEMP_DATASETS; do
  orig_size=$(zfs get -H -o value used "$ds") || log_and_exit_on_error "Failed to get size for $ds"
  new_ds=${ds/temp_pool/original_pool}
  new_size=$(zfs get -H -o value used "$new_ds" 2>/dev/null || echo "N/A")
  
  # Check if dataset exists in original_pool
  if [[ "$new_size" == "N/A" ]]; then
    log_and_exit_on_error "Dataset $new_ds doesn't exist in original pool"
  fi
  
  printf "%-50s %-15s %-15s\n" "${ds/temp_pool/original_pool}" "$orig_size" "$new_size" | tee -a "$LOG_FILE"
done

# 4. Check dataset counts
echo "" >> "$LOG_FILE"
echo "====== DATASET COUNT COMPARISON ======" | tee -a "$LOG_FILE"
TEMP_COUNT=$(zfs list -r temp_pool -H | wc -l) || log_and_exit_on_error "Failed to count temp_pool datasets"
ORIG_COUNT=$(zfs list -r original_pool -H | wc -l) || log_and_exit_on_error "Failed to count original_pool datasets"

echo "Original temp_pool datasets: $TEMP_COUNT" | tee -a "$LOG_FILE"
echo "New original_pool datasets: $ORIG_COUNT" | tee -a "$LOG_FILE"

# Check if counts match
if [[ "$TEMP_COUNT" -ne "$ORIG_COUNT" ]]; then
  log_and_exit_on_error "Dataset counts don't match. Some datasets may be missing."
fi

# 5. Check pool properties
echo "" >> "$LOG_FILE"
echo "====== POOL PROPERTIES ======" | tee -a "$LOG_FILE"
zpool get all original_pool | grep -E 'frag|capacity|health|size|allocated|free' >> "$LOG_FILE" || log_and_exit_on_error "Failed to get pool properties"

# 6. Verify fragmentation reduction
FRAG_LEVEL=$(zpool list -H -o frag original_pool)
echo "Current fragmentation level: $FRAG_LEVEL" | tee -a "$LOG_FILE"

# Check if fragmentation is reduced
if [[ "${FRAG_LEVEL%\%}" -gt 10 ]]; then
  echo "WARNING: Fragmentation is still high at $FRAG_LEVEL" | tee -a "$LOG_FILE"
fi

# 7. Start a scrub to verify data integrity
echo "" >> "$LOG_FILE"
echo "====== STARTING DATA INTEGRITY VERIFICATION ======" | tee -a "$LOG_FILE"
zpool scrub original_pool || log_and_exit_on_error "Failed to start scrub on original_pool"
echo "Scrub started at $(date)" | tee -a "$LOG_FILE"

# 8. Final verification summary
echo "" >> "$LOG_FILE"
echo "====== VERIFICATION SUMMARY ======" | tee -a "$LOG_FILE"
echo "Datasets verified: $ORIG_COUNT" | tee -a "$LOG_FILE"
echo "Fragmentation level: $FRAG_LEVEL" | tee -a "$LOG_FILE"
echo "All verification checks passed successfully!" | tee -a "$LOG_FILE"
echo "Verification completed at $(date)" | tee -a "$LOG_FILE"

echo "Verification successful! Check $LOG_FILE for details."

Preventing Future Fragmentation

To minimize fragmentation in the future:

  1. Maintain Free Space: Keep pools below 80% utilization when possible.
  2. Manage Snapshots: Regularly clean up unnecessary snapshots.

Monitor Fragmentation: Create a simple monitoring script:

#!/bin/bash
zpool list | grep -v FRAG | awk '{print $1,$7}' | while read pool frag; do
  pct=$(echo $frag | tr -d '%')
  if [ "$pct" -gt 20 ]; then
    echo "WARNING: Pool $pool has high fragmentation: $frag"
  fi
done

Consistent Recordsize: Use appropriate recordsize for your workload:

# For general purpose:
zfs set recordsize=128K pool/dataset

# For large files:
zfs set recordsize=1M pool/dataset

Performance Impact

After successful defragmentation, you can expect:

  • Improved write performance, especially for large files
  • More efficient space utilization
  • Potentially improved read performance, particularly on HDDs
  • More predictable storage behavior

Conclusion

ZFS fragmentation is a natural consequence of normal storage operations. While the defragmentation process requires significant effort and temporary downtime, the performance benefits can be substantial for heavily fragmented pools. Regular monitoring and proactive management will help maintain optimal ZFS performance over time.

Remember that the best approach to fragmentation is preventative - maintaining adequate free space and using appropriate ZFS properties for your workload will minimize the need for full defragmentation procedures.

Read more