HPC in the Cloud

Building an AWS environment for quantum-device simulation

Author

Cameron Rutherford

Published

June 15, 2026

Who I Am

I worked at Amazon supporting the Center for Quantum Computing (CQC).

Amazon CQC

Designing Superconducting Qubit Chips

Palace

Palace, for PArallel LArge-scale Computational Electromagnetics, is an open-source, parallel finite element code for full-wave 3D electromagnetic simulations in the frequency or time domain, using the MFEM finite element discretization library and libCEED library for efficient exascale discretizations.

Palace (Meshing)

Palace (Simulation)

Intro

How do we design an HPC environment on AWS for quantum-device simulation?

Requirements:

On-demand compute for bursty workloads

HPC Scheduler / Queue for Palace simulations and other workflows

Interactive desktops for analysis and visualization

Secure and reliable environment for real research

Burtsy Workloads

A key constraint was that workloads were bursty.

Researchers often needed:

Large instances on short notice (100+ c7gn nodes per simulation)
Interactive desktops for analysis and visualization
GPU capacity for visualization and related work
Shared software environments for HPC and desktop work

Always on nodes would have been too expensive, and the demand was not predictable enough to rely on reserved capacity.

SOCA Architecture

SOCA Improvements

Generally five categories of improvements were needed:

Better AMI (Amazon Machine Images) images and build process
Architectural improvements through updating the codebase
Security and authentication improvements
AI / workflow improvements
Palace / Spack improvements

AMIs Improvements

Improve the process for building AMIs

AMI builder scripts didn’t have set -x, so I added that for better debugging
Some gnuutils libraries are blocked in us-east-1, so we had to move to building AMIs in us-east-2, and copying them to to other regions

Add new features to the AMIs

GPU support was added, requiring navigation of NVDIA drivers
Updating things like gcc and other packages to support newer instance types

Improve load times and UX for AMIs

x86/arm64, GRID / TESLA, desktop / compute nodes (5 total AMIs)
To speed up load times, needed to modify SOCA code to prevent runtime installation of some packages and to support custom AMIs

Architectural Improvements

Deprecate our old AL2 deployment

Moving from AL2 to Ubuntu 22.04 for better VSCode support (glibc et al.)
Port OpenLDAP from one scheduler to another, which was non-trivial

Upgrade from SOCA-lite to newest version of code (3+ year gap)

Moving to EC2 Fleet from EC2 ASG for better spot capacity management
Windows desktop support for windows specific software for some users
ODCR for improved UX when Insufficient Capacity Error (ICE) occurs

Additional improvements to the infrastructure

Queue default limits to reduce impact of problem users, improve reliability
Increased capacity + OSTs to reduce IOPs contention, capacity alerts

Security and Auth Improvements

Move from OpenLDAP to AWS Managed Active Directory (MAD), adding SSO

MAD removed single point of failure, easing transitions between deployments
SSO removes passwords for auth. Implemented ALB in Typescript + patches

Improvements around SOCA and AWS MAD integration

Clearing desktops from AD when terminated to prevent duplicate objects
JSON based AD infromation to remove the need for compute nodes to AD join

More linux related security improvements

Adding rootless container runtimes / profiling to SOCA, allowing for POLP
Improving filesystem permissions to use linux groups, not r-x everywhere

AI / Workflow Improvementss

Virtual Desktop UX improvement / VSCode + Tmux

Gnome desktop support, as well as network improvements to reduce latency
Tmux + VSCode for persistent terminals when doing agentic / julia work

AI related workflow improvements

SOCA AI plugin to reduce user pain points, giving context to documentation
CCMT MCP, and reverse engineering for programmatic capacity queries

Traditional HPC workflow improvements

For loops within jobs (flux), instead of for loops submitting jobs
/scratch for Palace and other workflows (Python/Julia slow on Lustre)

Spack / Palace Improvements

Spackifying Palace

Transition from CMake super-build to modular and spack based
GHCR cache in CI to reduce builds form 1hr 15min to 15min!
ECR as OCI storage prototyping with Spack developers

Palace on EKS? Viable alternative but not spot-compute…

Spack OCI builds allowed container generation in GitHub CI
Paladin for cross-region workflows using REST API + EKS backends

Profiling, Performance, and Memory

OOM tripwire for better UX when running simulations that are “too large”
Transition from shared to dedicated instances due to “noisy neighbour” issue

Lessons Learned

The biggest lesson was that infrastructure is not just about machines. It is about enabling a scientific workflow, and making user’s lives easier.

That means paying attention to:

What the current limitations are, and how they impact users
Security and reliability concerns
The overall research workflow, and how infrastructure can support it

Questions?