10.1 C
New York

DWH Performance Tuning on AWS

Published:

Moving large-scale financial systems to the public cloud is a high-stakes game. In regulated industries, you often cannot perform casual test-in-production experiments. You have to design correctly and then verify quickly.

When migrating a massive data warehouse (DWH) from on-premises hardware to AWS EC2, the natural instinct is to match specifications: “If I had 16 cores and 128 GB RAM on-prem, I need an x1e.xlarge on AWS.”

However, desk calculations often fail to predict real-world behavior. This article explores performance pitfalls encountered during a real-world DWH migration — specifically related to SQL execution speeds, EBS burst balances, and Oracle ASM device mapping — and how to solve them without blowing up your budget.

The Problem: The Desk Calculation Trap

In this migration, the engineering team initially selected the EC2 X1e family (memory-optimized) because its specifications closely mirrored the existing on-premises hardware.

However, during benchmark testing using actual SQL queries, the results were counterintuitive:

  • X1e instance: Slower execution, higher latency
  • R5 instance: Faster execution, despite having “lower-tier” specs in some documentation

The Lesson:
SQL workloads often rely heavily on single-thread performance. The newer-generation chipsets in the R5 family provided better per-core performance than the older X1e generation, even if the total memory throughput appeared lower on paper. Always benchmark your actual queries; never rely solely on spec sheets.

The Bottleneck: The Silent I/O Killer

Once the instance type was finalized, the team encountered a familiar pattern during batch processing:

  • Start: The batch job runs at high speed
  • Middle: Performance degrades sharply after ~15 minutes
  • Metrics: CPU utilization drops while I/O wait spikes to nearly 100%

Diagnosing EBS Burst Balance

The culprit was the EBS burst balance. Standard EC2 instances have dedicated bandwidth for EBS (Elastic Block Store) traffic. When a heavy batch job runs, it consumes this bandwidth. Once burst credits are exhausted, throughput is throttled back to the baseline.

In a database environment, this is catastrophic: the CPU sits idle while waiting for disk I/O.

The Solution: The “b” Variant

The traditional fix is to upscale the instance (for example, moving from xlarge to 2xlarge). However, this introduces a major cost issue for commercial databases like Oracle, which are licensed by CPU core count. Doubling the instance size doubles the license cost — just to gain more I/O bandwidth.

The Fix: Switch to R5b instances.

The “b” suffix in AWS instance families (e.g., r5b, m5b) denotes block storage optimization. These instances provide up to 3× the EBS bandwidth of their non-b counterparts without increasing the CPU count.

  • R5.xlarge: ~4,750 Mbps throughput (baseline is lower)
  • R5b.xlarge: ~10,000 Mbps throughput

By switching to R5b, batch processing times stabilized and I/O wait dropped to negligible levels — without increasing Oracle licensing fees.

Technical Deep Dive: The “Nitro” Boot Issue

Switching instance types isn’t always a simple stop/start operation, especially when moving from older Xen-based instances to modern Nitro-based instances (such as R5 and R5b).

1. initramfs and NVMe Drivers

Modern AWS instances expose EBS volumes as NVMe devices. If your Linux OS was installed on an older instance, the dracut initramfs generator may not have included the nvme kernel module.

Symptom:

After changing the instance type, the instance fails to boot (status check failure).

The Fix:

Before changing the instance type, regenerate the initramfs to include NVMe drivers.

Bash code snippet:

# Force the inclusion of NVMe drivers
# (This command varies by distro, example for RHEL/CentOS)
dracut -f -v --add-drivers "nvme"

Note: On AWS, if you don’t use the -H (host-only) option, dracut usually includes common drivers. However, verifying that the NVMe module is present in the boot image is critical before migration.

2. Fixing Oracle ASM Device Mapping

On older instances, devices appeared as /dev/xvd*. On Nitro instances, they appear as /dev/nvme*n*. This breaks Oracle ASM (Automatic Storage Management) if it relies on hard-coded paths or udev rules pointing to the old names.

The Fix: Update udev rules to dynamically recognize NVMe paths.

Old rule (Xen):

KERNEL=="xvdc1", OWNER="grid", GROUP="asmadmin", MODE="0660"

New rule (Nitro / NVMe):

# Use a wildcard for the NVMe namespace/partition
KERNEL=="nvme*p1", OWNER="grid", GROUP="asmadmin", MODE="0660"

You must also update the Oracle ASM disk string to match the new device pattern.

SQLPlus Command:

-- Point ASM to the new NVMe devices
ALTER SYSTEM SET asm_diskstring='/dev/nvme*';

Summary: Optimization Checklist

If you are lifting a database-heavy workload to AWS, follow this checklist to avoid the “paper spec” trap:

  • Do not trust specs blindly: Newer-generation instances with fewer “paper specs” often outperform older generations due to IPC (instructions per clock) improvements.
  • Watch the burst balance: If performance degrades after 15–30 minutes, you are likely hitting EBS limits. Monitor I/O wait closely.
  • Optimize for licensing: Use b instances (e.g., r5b, m5b) to increase I/O throughput without increasing vCPU count — and licensing costs.
  • Prepare for NVMe: When migrating to Nitro instances, ensure your OS ramdisk includes NVMe drivers and update database storage mappings (udev/ASM).

By proactively addressing these architectural nuances, you can deliver a high-performance cloud environment that satisfies both engineers and budget owners alike.

Source link

Related articles

Recent articles