HPC in the Cloud
Building an AWS environment for quantum-device simulation
June 15, 2026
Who I Am
I worked at Amazon supporting the Center for Quantum Computing (CQC).
![]()
Split Keyboard == True Programmer
Amazon CQC
![]()
Ocelot Chip
Designing Superconducting Qubit Chips
s
Palace
![]()
Palace, for PArallel LArge-scale Computational Electromagnetics, is an open-source, parallel finite element code for full-wave 3D electromagnetic simulations in the frequency or time domain, using the MFEM finite element discretization library and libCEED library for efficient exascale discretizations.
Palace (Meshing)
![]()
Single Transmon Mesh
Palace (Simulation)
![]()
Single Transmon Simulation
Intro
How do we design an HPC environment on AWS for quantum-device simulation?
Requirements:
- On-demand compute for bursty workloads
- HPC Scheduler / Queue for Palace simulations and other workflows
- Interactive desktops for analysis and visualization
- Secure and reliable environment for real research
Burtsy Workloads
A key constraint was that workloads were bursty.
Researchers often needed:
- Large instances on short notice (100+ c7gn nodes per simulation)
- Interactive desktops for analysis and visualization
- GPU capacity for visualization and related work
- Shared software environments for HPC and desktop work
Always on nodes would have been too expensive, and the demand was not predictable enough to rely on reserved capacity.
SOCA Architecture
![]()
Outdated / Slightly Inaccurate
SOCA Improvements
Generally five categories of improvements were needed:
- Better AMI (Amazon Machine Images) images and build process
- Architectural improvements through updating the codebase
- Security and authentication improvements
- AI / workflow improvements
- Palace / Spack improvements
AMIs Improvements
- Improve the process for building AMIs
- AMI builder scripts didn’t have
set -x, so I added that for better debugging
- Some gnuutils libraries are blocked in us-east-1, so we had to move to building AMIs in us-east-2, and copying them to to other regions
- Add new features to the AMIs
- GPU support was added, requiring navigation of NVDIA drivers
- Updating things like gcc and other packages to support newer instance types
- Improve load times and UX for AMIs
- x86/arm64, GRID / TESLA, desktop / compute nodes (5 total AMIs)
- To speed up load times, needed to modify SOCA code to prevent runtime installation of some packages and to support custom AMIs
Architectural Improvements
- Deprecate our old AL2 deployment
- Moving from AL2 to Ubuntu 22.04 for better VSCode support (glibc et al.)
- Port OpenLDAP from one scheduler to another, which was non-trivial
- Upgrade from SOCA-lite to newest version of code (3+ year gap)
- Moving to EC2 Fleet from EC2 ASG for better spot capacity management
- Windows desktop support for windows specific software for some users
- ODCR for improved UX when Insufficient Capacity Error (ICE) occurs
- Additional improvements to the infrastructure
- Queue default limits to reduce impact of problem users, improve reliability
- Increased capacity + OSTs to reduce IOPs contention, capacity alerts
Security and Auth Improvements
- Move from OpenLDAP to AWS Managed Active Directory (MAD), adding SSO
- MAD removed single point of failure, easing transitions between deployments
- SSO removes passwords for auth. Implemented ALB in Typescript + patches
- Improvements around SOCA and AWS MAD integration
- Clearing desktops from AD when terminated to prevent duplicate objects
- JSON based AD infromation to remove the need for compute nodes to AD join
- More linux related security improvements
- Adding rootless container runtimes / profiling to SOCA, allowing for POLP
- Improving filesystem permissions to use linux groups, not
r-x everywhere
AI / Workflow Improvementss
- Virtual Desktop UX improvement / VSCode + Tmux
- Gnome desktop support, as well as network improvements to reduce latency
- Tmux + VSCode for persistent terminals when doing agentic / julia work
- AI related workflow improvements
- SOCA AI plugin to reduce user pain points, giving context to documentation
- CCMT MCP, and reverse engineering for programmatic capacity queries
- Traditional HPC workflow improvements
- For loops within jobs (flux), instead of for loops submitting jobs
/scratch for Palace and other workflows (Python/Julia slow on Lustre)
Spack / Palace Improvements
- Spackifying Palace
- Transition from CMake super-build to modular and spack based
- GHCR cache in CI to reduce builds form 1hr 15min to 15min!
- ECR as OCI storage prototyping with Spack developers
- Palace on EKS? Viable alternative but not spot-compute…
- Spack OCI builds allowed container generation in GitHub CI
- Paladin for cross-region workflows using REST API + EKS backends
- Profiling, Performance, and Memory
- OOM tripwire for better UX when running simulations that are “too large”
- Transition from shared to dedicated instances due to “noisy neighbour” issue
Lessons Learned
The biggest lesson was that infrastructure is not just about machines. It is about enabling a scientific workflow, and making user’s lives easier.
That means paying attention to:
- What the current limitations are, and how they impact users
- Security and reliability concerns
- The overall research workflow, and how infrastructure can support it
Questions?
![]()
Thank you for listening!