Deploying LLMs Locally and on AWS: A Step-by-Step Walkthrough with Atelier

Picking up from Vibe Coding A Coding Harness, where I vibe-coded a harness - dubbed Atelier - with Claude Code to learn how harnesses are built and get a feel for producing large codebases. The GitHub repo is fully documented, but I wanted something livelier than a README: a walk-through of standing up large language models, locally and on AWS, and then using the harness.

Atelier: what the harness gives you

Atelier is a bring-your-own-model coding harness — three frontends (Tauri GUI, ratatui TUI, headless CLI) over one engine. You wire up a model, point it at a repo, and the harness handles everything between prompt and diff.

Models are swappable mid-session: Anthropic's Messages API, anything OpenAI-compatible (vLLM, Ollama, LM Studio, llama-server, sglang, OpenAI itself), or a Mock for tests. First contact with a new model triggers a capability probe cached under ~/.atelier/model_profiles/; the GUI's model-fit badge flags any mismatch between what a model claims and what it actually does.

Every turn carries a structured envelope (claimed changes, plan, uncertainty, grounding) alongside the prose. A verification gate diffs the claimed changes against the staged diffs before calling anything "done." Tools run sandboxed; sessions persist with atomic writes and resume after a kill -9; context can be compacted into pinned memory cards and expanded back. Cost and tokens land on a footer ledger.

What it doesn't do: no Bedrock or Vertex adapter yet and no autonomous mode.

What We Will Cover

This post is divided into four parts:

Prerequisites - hardware and software
Local setup - getting the harness running against a local LLM
AWS setup - deploying a hosted LLM and wiring it into the harness
A tour - the harness's features

Prerequisites

Supported Hardware

Platform	Supported
Apple Silicon (M1 / M2 / M3 / M4)	✅ Yes — `aarch64.dmg` release asset
Intel macOS	❌ No — intentionally not shipped

Reference / Development Machine

The harness was developed on a MacBook with the following spec:

Field	Value
CPU	Apple M1 Pro — 10 cores (8 Performance + 2 Efficiency)
RAM	32 GB
Disk	≥250 GB free required during benchmark runs
OS	macOS 26.4.1

Build Prerequisites

These requirements need to be met if you intend to build the harness via its code - for simply running the harness they can be ignored.

Tool	Notes
Rust 1.85.0 via `rustup`	Pinned by `rust-toolchain.toml`; fetched automatically on first `cargo` invocation
Python 3	Required for `make check` and the calibration rig
version 3.14.4 was used for building the harness
Node + npm	Required for the Tauri GUI frontend (`npm --prefix crates/atelier-gui/ui install`)
Node v25.8.2 was used for building the harness
`cargo tauri` CLI v2.x	GUI development only (`cargo tauri dev`)

Using the Atelier GUI with a Local LLM

1. Install the Atelier CLI

brew install ChrisAdkin8/atelier/atelier

Verify the installation:

atelier --version

2. Install the Atelier GUI

Download Atelier_0.1.1_aarch64.dmg from the v0.1.1 GitHub Release

 gh release download v0.1.1 \
   --repo ChrisAdkin8/atelier \
   --pattern "Atelier_0.1.1_aarch64.dmg" \
   --dir ~/Downloads

Then install:

open ~/Downloads/Atelier_0.1.1_aarch64.dmg

Drag Atelier into your Applications folder:

Clear Gatekeeper quarantine before launching for the first time:

xattr -dr com.apple.quarantine /Applications/Atelier.app

3. Install Ollama and pull the model

brew install ollama
ollama pull qwen2.5-coder:7b

4. Start the Ollama server

Leave this running in a terminal tab. Ollama listens on http://localhost:11434 by default.

ollama serve

5. Create a providers.toml file

Create a providers.toml file in ~/.atelier with the following contents:

default = "local"
  
[providers.local]
provider = "openai-compat"
base_url = "http://localhost:11434/v1"
model    = "local:qwen2.5-coder:7b"
  
[runner]
max_turns = 32
  
[probe]
policy = "auto"

Rules to keep in mind
enforced at config load - malformed/insecure files fail loudly, they don't silently fall back:

default must name a profile you defined.
base_url for OpenAI-compatible servers must end in /v1.
Never put a literal api_key = "sk-..." — use keyring:SERVICE/USER (set via atelier providers auth ) or env:NAME.

6. Initialise your repo

From the repository you want Atelier to work in:

atelier init

7. Open the GUI

open /Applications/Atelier.app

Click Browse... in the header and select your repo. The local profile will be active by default. The model-fit badge in the footer (an indicator showing whether the model meets the harness's context and capability requirements) confirms the harness has recognised qwen2.5-coder:7b from its static capability table.

Note: Start ollama serve before opening the GUI. If Ollama is not running when you submit your first Agent prompt, the harness will fail the run and auto-draft a memory card (a structured note the harness writes to itself about run failures and environmental quirks) noting the provider is unreachable.

Using the Atelier GUI with an AWS-Hosted LLM

A local LLM works, but on a 32 GB MacBook, responses can be sluggish and subagents (parallel worker LLMs the main agent can delegate subtasks to) are barely usable. For real headroom you want a hosted model on bigger hardware - either a runtime like vLLM on a GPU instance, or a managed platform like AWS Bedrock, GCP Vertex AI, or Azure Foundry. The repo supports both Bedrock and vLLM-on-EC2; we'll take the vLLM path.

Prerequisites

AWS account with EC2 GPU quota for the selected instance type (default: p4d.24xlarge).

Note this instance type costs (at the time of writing this post) $32 an hour.
Terraform ≥ 1.6
AWS CLI configured with credentials that can create EC2, ALB, IAM, Secrets Manager, and CloudWatch resources
Hugging Face account and access token — required for gated models such as Meta Llama
atelier harness installed locally

Steps

1. Clone the repository

git clone https://github.com/ChrisAdkin8/atelier-bedrock-infra.git
cd atelier-bedrock-infra

2. Store the Hugging Face token in Secrets Manager

Gated models (Meta Llama, and others) require a Hugging Face access token. Use the helper script:

HF_TOKEN=hf_YOUR_TOKEN_HERE \
AWS_REGION=us-east-1 \
bash scripts/create-huggingface-secret.sh

The script prints the secret ARN. Copy it for use in the next step. If the secret already exists, update it instead:

aws secretsmanager put-secret-value \
  --secret-id huggingface-token \
  --secret-string "hf_YOUR_TOKEN_HERE" \
  --region us-east-1

Whilst testing this - at one point I managed to somehow create the Hugging Face token secret and LLM resources in different regions. Debugging this provided some interesting insights around the stages that the model undergoes whilst being deployed, which I will cover in a separate blog.

3. Configure terraform.tfvars

cp terraform/terraform.tfvars.example terraform/terraform.tfvars

Edit terraform/terraform.tfvars. The minimum required changes are highlighted below.

Recommended default - Qwen2.5 72B AWQ

For the purposes of brevity, the terraform.tfvars file omits some of the comments present in the terraform.tfvars file to be found in the repo:

aws_region                        = "us-west-2"
enable_gpu_vllm                   = true
gpu_vllm_hugging_face_secret_name = "huggingface-token"
gpu_vllm_model_id                 = "Qwen/Qwen2.5-72B-Instruct-AWQ"
gpu_vllm_served_model_name        = "qwen2.5-72b-instruct-awq"
gpu_vllm_instance_type            = "p4d.24xlarge"
gpu_vllm_tensor_parallel_size     = 8 # Qwen2.5-72B has 8 KV heads -> one per GPU
gpu_vllm_max_model_len            = 32768
gpu_vllm_extra_args               = "--quantization awq_marlin --kv-cache-dtype fp8_e5m2 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser hermes"
gpu_vllm_use_capacity_reservation = false

p4d.24xlarge
8 × A100 40GB GPUs (320 GiB total HBM2e, 600 GB/s NVLink, ~$32/hour). I'd hoped to use the beefier p4de.24xlarge, but kept hitting AWS capacity shortages in-region and fell back to the p4d.24xlarge.

AWQ INT4 shrinks the 72B model to ~39 GiB - it'd fit comfortably on a single 80GB card (which is what the p4de I wanted has). But p4d gives you 40GB cards, where 39 GiB of weights leaves no room for KV cache. So you shard across eight GPUs, which also buys concurrency and latency:

Concurrency. Spreading the model across 8 GPUs pools all 320 GiB of HBM and frees ~28 GiB per GPU for KV cache — about 22 concurrent 32K-context requests at fp16, which fp8 KV then roughly doubles to ~45.
Latency. Tensor parallelism splits the work, so each token decodes faster (roughly proportional to GPU count, minus communication overhead). The user waiting for the first token feels this directly.

vLLM flags explained

This excerpt from the terraform.tfvars file relates to the configuration for vLLM and the hardware platform it runs on.

gpu_vllm_instance_type            = "p4d.24xlarge"
gpu_vllm_tensor_parallel_size     = 8 # Qwen2.5-72B has 8 KV heads -> one per GPU
gpu_vllm_max_model_len            = 32768
gpu_vllm_extra_args               = "--quantization awq_marlin --kv-cache-dtype fp8_e5m2 --enable-prefix-caching --enable-auto-tool-choice --tool-call-parser hermes"

tensor_parallel_size = 8
KV heads must divide evenly across GPUs. Qwen2.5-72B has 8 KV heads, so one per GPU. A model with 6 KV heads would force you down to 2 or 4.

max_model_len = 32768
caps context at 32K per request. The model supports 128K; we cap lower on purpose. --quantization awq_marlin runs AWQ INT4 weights through the Marlin kernel — same accuracy as awq, ~25% faster decode on Ampere.

--kv-cache-dtype fp8_e5m2
stores KV cache as 8-bit instead of 16-bit. Compute stays fp16/bf16 (vLLM converts on the fly), so it only shrinks storage, not math - which is why it works on A100s despite no hardware FP8. Roughly doubles cache capacity.

--enable-prefix-caching
reuses KV blocks for unchanged prompt prefixes across requests. On by default under the V1 engine; listing it explicitly is defensive on older/V0 setups where it's opt-in. For agent loops where N sub-agents share a 20K planner prefix, fan-out drops from N × 20K prefill to 1 × 20K + N × small. Combined with fp8 KV, marginal fan-out cost approaches zero.

--enable-auto-tool-choice
--tool-call-parser hermes
Qwen2.5 tool calling needs both. The first lets the model decide when to call a tool; the second tells vLLM how to parse Qwen's <tool_call>{...}</tool_call> blocks into structured tool_calls. Wrong parser fails either loud (400 error) or silent (raw XML in content, empty tool_calls — worse). Match to model: Qwen2.5 → hermes, Qwen3-Coder-Next → qwen3_coder, Mistral/Nemo → mistral.

In Summary

awq_marlin: +25% decode throughput, free.
fp8_e5m2 KV: roughly doubles the concurrent 32K requests the EC2 instance can accommodate.
enable-prefix-caching: sub-agent fan-out essentially free.
auto-tool-choice + hermes: tool calling works.

This configuration for the p4d.24xlarge EC2 instance type gives us ~45 concurrent 32K-context slots with native tool calling and prefix-cached fan-out.

Optional - reuse an existing VPC

If your AWS account is at its VPC limit, point Terraform at an existing VPC instead of creating one:

vpc_id            = "vpc-0abc1234def56789"
public_subnet_ids = ["subnet-aaa111", "subnet-bbb222"]

Recommended - HTTPS via ACM

The ALB is public and clients authenticate with a bearer token, so HTTPS is strongly recommended for anything beyond local testing. If you have an ACM certificate, set gpu_vllm_certificate_arn to expose HTTPS and redirect HTTP. Without it, the ALB uses plain HTTP and the API key travels in the clear:

gpu_vllm_certificate_arn = "arn:aws:acm:us-east-1:123456789012:certificate/abc-..."

4. Apply Terraform

cd terraform
terraform init
terraform apply

Terraform provisions:

A dedicated VPC with public subnets (or reuses yours)
A public Application Load Balancer
A GPU EC2 instance running vllm/vllm-openai via Docker on the AWS Deep Learning Base AMI
An IAM instance profile with Secrets Manager and CloudWatch Logs permissions
A generated vLLM API key stored in Secrets Manager under atelier/gpu-vllm/<environment>/api-key
A CloudWatch log group for bootstrap output

The first apply takes 10–20 minutes. vLLM pulls the model weights from Hugging Face on first boot, which can add another 10–30 minutes depending on model size.

5. Retrieve the endpoint

terraform output -raw openai_api_base
terraform output -raw openai_model_name
terraform output -raw openai_api_key

6. Verify the vLLM endpoint

curl -H "Authorization: Bearer $(terraform output -raw openai_api_key)" \
  "$(terraform output -raw openai_api_base)/models"

A healthy response lists the served model name. The ALB health check polls /health every 30 seconds and marks the instance healthy after two consecutive successes.

7. Configure atelier

Obtain the OPENAPI key:

terraform output openai_api_key

Paste the key into the following command in place of the placeholder - and then run the command:

security add-generic-password \
    -s "atelier/providers/qwen2.5-72b-awq" \
    -a "atelier" \
    -w "<YOUR_OPEN_API_KEY_GOES_HERE>"

Create a directory for the configuration file that holds the details of the models available to the harness:

mkdir -p <your-project>/.atelier
cp atelier/providers.toml.example <your-project>/.atelier/providers.toml

Edit .atelier/providers.toml and add a profile that points at the vLLM endpoint:

default = "qwen2.5-72b-awq"

[providers."qwen2.5-72b-awq"]
provider = "openai-compat"
base_url = "http://atelier-gpu-vllm-dev-1234567890.us-west-2.elb.amazonaws.com/v1"
model    = "qwen2.5-72b-instruct-awq"
api_key  = "keyring:atelier/providers/qwen2.5-72b-awq"

Run atelier from your project directory:

# Default profile:
atelier run "explain the authentication flow"

# Explicit profile:
atelier run --profile vllm "refactor the auth module"

Teardown

cd terraform
terraform destroy

The Secrets Manager secret for the API key has a 7-day recovery window. The Hugging Face token secret you created manually is not managed by this Terraform configuration and must be deleted separately if required.

Optional - Bedrock-backed LiteLLM proxy

The Bedrock path is disabled by default and not required for vLLM. Enable it only if you also want a Bedrock-backed LiteLLM proxy on ECS/Fargate:

enable_litellm_bedrock = true
enable_bedrock_logging = true

terraform output -raw litellm_api_base
terraform output -raw litellm_master_key

Using Atelier: A Feature Tour

When we start the atelier GUI, this is what greats us:

The GUI has correctly detected that the context window is 32K in size, we can see a token meter at the bottom of the screen and that qwen2.5-72b-awq is the default model.

Workspaces
This is essentially the root directory for the local repo where the harness generated code and artefacts will go.

Model Scoring
A badge in the bottom right hand corner reflects the 'Quality' of the model via a score - clicking on this will reveal detailed insights into the score:

Running A Prompt
Lets run a simple prompt in the form of "Give me a history of large language models":

4,119 characters of the context window have been used, the token meter to its right has been populated, and the event logs and context windows have been populated.

Subagent Support
The harness supports subagents - this screen grab displays the result of:

In parallel, count all the rust and python files in project workspace directory and all of its sub directories

Agent Skills Support
Entering a forward slash in the prompt entry box brings up the skills that come bundled in with the harness:

The Context Pane - "What am I paying for this turn"
This displays the GUI's live view of the agent's working-set: everything currently in the model's context, item-by-item - each row a ContextItemSummary showing token count (cyan = exact, yellow = approx, grey = unknown), 4-char provenance badge (init/usr/tool/mem/pin/asst), the item label (file path, first line, or sha256 prefix), and per-item actions. This is how you see what the model is seeing - file refs the agent pulled in, tool results, memory hits, user messages - and roughly how much budget each piece costs.

Clicking the checkbox at the side of a context item prevents it from being evicted - useful when you want to make sure a particular file, spec excerpt, or tool result survives the next round of auto-compaction or eviction across turns.

The Memory Card Pane - "What should survive beyond compaction"
Memory cards are created in one of two ways, via:

you - the user
Cards can be created by manually entering free text into the memory card window.
auto-compaction
When compaction fires, the dispatcher picks the largest unpinned items, asks the model to summarise them, and writes one card that's pinned by default.

Memory cards provide the agent loop with somewhere to put a compaction summary that isn't the prompt itself. You can see the summary, decide it ate the wrong items, expand it back, and try again.

The Mental Model Pane

Think of the mental model pane as being akin to an claude.md or agents.md file, it allows free form text to be entered that acts as user-supplied guidance.

Summary

The prior post to this recapped my experiences vibe-coding a coding harness. This post covers the process of deploying the harness GUI and using it with models running both locally and on AWS. The next post will cover some of the insights I gained into how the model is deployed while debugging an issue where the Terraform config resources and the Hugging Face token secret were deployed to different regions by mistake.

Getting Up And Running With Atelier

Atelier: what the harness gives you

What We Will Cover

Prerequisites

Supported Hardware

Reference / Development Machine

Build Prerequisites

Using the Atelier GUI with a Local LLM

Using the Atelier GUI with an AWS-Hosted LLM

Prerequisites

Steps

Using Atelier: A Feature Tour

Summary

Comments

More from this blog

Vibe Coding A Coding Harness

Vibe Coding Terraform Configs The Hard Way

Taming Rogue Agents: Defeating ASI03 with SPIFFE Identity Federation

Taming Rogue Agents: Defeating ASI03 with Vault Dynamic Secrets

Command Palette

Atelier: what the harness gives you

What We Will Cover

Prerequisites

Supported Hardware

Reference / Development Machine

Build Prerequisites

Using the Atelier GUI with a Local LLM

Using the Atelier GUI with an AWS-Hosted LLM

Prerequisites

Steps

Using Atelier: A Feature Tour

Summary

Comments

More from this blog