elixir infrastructure beam otp June 10, 2026 9 min read Markdown

The BEAM Is the YAML You Actually Wanted

YAML Is a Tax. Use a Supervisor.

I haven't written Terraform by hand since 2024. Neither have you, probably. We prompt an LLM. We review the diff the way VAC reviews cheaters: scan the queue, nod, wave everyone through, aimbot and all. Apply.

Fine. I've made my peace with that part. What bothers me is the output. Thousands of lines of YAML. You can't loop over it. You can't branch on it. You can't ask it a question. And when you need a conditional, the official answer is a templating language bolted onto a templating language.

Disclosure: I build distributed services in Elixir, and I did the ops underneath them. So I'm exactly the person who'd tell you the BEAM solves this. Known bias. Stick around anyway — I argue against myself before the end.

The Problem with "Infrastructure as Data"

YAML and HCL are configuration languages. They describe what you want, not how to get there. That's deliberate: constrained languages prevent footguns. But the constraint creates its own loop, and everyone reading this has been trapped in it:

flowchart LR
    A[Write YAML] --> B{Configurable enough?}
    B -- no --> C[Add templating layer<br/>Helm, Jinja, etc.]
    C --> A
    B -- yes --> D[Ship it]

The moment your YAML needs a loop, a conditional, or a value computed from three other values, you've invented a programming language. You've just chosen one with no functions, no debugger, and whitespace-sensitive semantics.

And when you need to call an external API, parse the response, and create resources based on what you found? That's not YAML's job, and it's not HCL's job either. So you duct-tape a Python script into the CI pipeline. Then a second one. Eighteen months later the scripts are load-bearing and the README says "ask Marko", and Marko left.

But Pulumi Exists

Here's the objection you should be making: "infrastructure in a real programming language" has been a shipping product since 2018. Pulumi. If "real language" were the whole argument, this post would be a Pulumi ad and you could close the tab.

It isn't. Pulumi kept Terraform's most important limitation: the execution model. Your TypeScript runs once, produces a plan, applies it, exits.¹ Better language, same lifecycle. A batch job. Between runs your infrastructure is unsupervised. Drift accumulates. Services die. The thing that's supposed to know about your system isn't running.

That's the gap, and it's why I keep reaching for the BEAM. Not because Elixir beats YAML as a language; that bar is a tripping hazard. Because the BEAM is a runtime. It stays up. It was built for the time between deploys. Which is most of the time.

What a Runtime Buys You

The BEAM gives you, in the standard library, the primitives every infrastructure tool eventually reinvents:

Processes: lightweight, isolated, crash independently. A failure is an event you handle, not an incident you discover.
Supervisors: restart policies as a data structure. Declarative fault tolerance, the thing your restart: always stanza is cosplaying as.
Ports: run external binaries as supervised children. The binary speaks stdin/stdout and doesn't know the BEAM exists.
Distribution: nodes that can see each other's processes and call each other's functions, in the standard library, since before "microservices" was a word.
mix release: one self-contained artifact, runs anywhere with a libc.

This is thirty years of someone else's production hardening, originally paid for by telephone switches that were not allowed to go down.² We've been using it to build chat apps. It's like inheriting a fire truck and using it to water the garden.

One distinction worth getting right, because it's where the fault-tolerance pitch lives or dies: ports, not NIFs, for anything you don't fully trust. A port is an external OS process; if your Rust binary segfaults, the BEAM notices and restarts it. A NIF runs inside the VM; if it segfaults, it takes your entire orchestrator down with it, along with this blog post's whole argument. NIFs are for hot loops you've already debugged. Ports are for everything else.

A Concrete Example

Say you need to run a Rust service for the hot path, supervise it, and restart it when it dies. Here's the whole thing. It compiles, it runs, and we're going to kill it to prove it.

defmodule Infra.RustService do
  @moduledoc """
  Wraps an external binary in a supervised GenServer via a Port.
  The binary's side of the contract: exit when stdin closes.
  """
  use GenServer

  require Logger

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl GenServer
  def init(opts) do
    path = Keyword.fetch!(opts, :path)
    port = Port.open({:spawn_executable, path}, [:binary, :exit_status])
    {:ok, %{port: port}}
  end

  @impl GenServer
  def handle_info({port, {:data, output}}, %{port: port} = state) do
    Logger.info("hot-path: #{String.trim(output)}")
    {:noreply, state}
  end

  def handle_info({port, {:exit_status, status}}, %{port: port} = state) do
    Logger.warning("hot-path exited with status #{status}")
    {:stop, :service_died, state}
  end
end

defmodule Infra.Supervisor do
  use Supervisor

  def start_link(arg) do
    Supervisor.start_link(__MODULE__, arg, name: __MODULE__)
  end

  @impl Supervisor
  def init(_arg) do
    children = [
      Infra.ConfigManager,
      {Infra.RustService, path: "/usr/local/bin/hot-path"}
    ]

    Supervisor.init(children, strategy: :one_for_one, max_restarts: 5, max_seconds: 30)
  end
end

The GenServer opens the port, logs the binary's output, and stops when the binary exits. The supervisor sees the stop and restarts the GenServer, which reopens the port, which respawns the binary. Now the demo:

$ kill -9 $(pgrep hot-path)

[warning] hot-path exited with status 137
[info] hot-path: hot-path v0.3.1 listening on /tmp/hot-path.sock

That's it. That's the pitch. No restartPolicy, no systemd unit, no sidecar. A process died, its supervisor was running (because the supervisor is always running), and the gap between failure and recovery was one message send.

Two honest footnotes. First, the contract cuts both ways: when the GenServer dies, the port closes, and your binary is expected to notice its stdin closed and exit. If you can't trust it to behave, MuonTrap exists precisely because other people couldn't trust theirs either. Second, max_restarts: 5, max_seconds: 30 means a binary that's crash-looping takes the supervisor down with it after five tries, which is correct: a thing that can't stay up for six seconds is a problem for a human, not a restart policy.

The Distribution Bonus

Where this stops being a process manager and starts being infrastructure: BEAM nodes form a cluster out of the box, and calling a function on another machine is a standard library call.

nodes = [:"deploy@prod-1", :"deploy@prod-2", :"deploy@prod-3"]

# Everywhere at once: local fan-out, remote execution.
results =
  nodes
  |> Task.async_stream(
    fn node -> {node, :erpc.call(node, Infra.Deploy, :run, [manifest], :timer.minutes(5))} end,
    timeout: :infinity
  )
  |> Enum.map(fn {:ok, result} -> result end)

# Or a rolling deploy, which is literally Enum.each:
Enum.each(nodes, fn node ->
  :ok = :erpc.call(node, Infra.Deploy, :run, [manifest], :timer.minutes(5))
  :ok = Infra.Health.await(node)
end)

A rolling upgrade with health gates is Enum.each. I have watched vendors charge six figures a year for Enum.each.

And when prod-2 is down, :erpc.call raises, right there, with a stack trace pointing at the node that failed. Compare that to discovering the same fact forty minutes later by spelunking through a CI runner's log archive, and tell me which one you want at 3 a.m.

The Part Where I Argue with Myself

Of course I reach for the BEAM first. I'm a man with a hammer, and it's a genuinely excellent hammer. So let me steelman the other side properly.

Ecosystem. Terraform has a provider for everything with an API and several things without one. The Elixir equivalent is ex_aws plus community libraries of varying freshness, plus Req and an afternoon. For mainstream cloud CRUD, Terraform's providers are simply better tools, and pretending otherwise would be evangelism of the embarrassing kind.

Team legibility. YAML can be read by everyone: developers, SREs, auditors, the intern, the LLM reviewing the intern. Elixir is read fluently by Elixir developers, a set whose intersection with "people on call at your company" may be exactly me. If the infrastructure layer is owned by people who don't write Elixir, this whole idea is a non-starter, and no blog post changes that.

State. Terraform's most underrated feature is tfstate. Provisioning is stateful, and the BEAM does not solve state; processes are ephemeral and ETS dies with the node. You'd reach for SQLite, DETS, or an S3 bucket, and congratulations, you're now maintaining a small bespoke state backend, the activity Terraform exists to spare you. Anyone who tells you the BEAM handles this is selling something. The honest version: the BEAM gives you a very good place to put the code that manages state, and nowhere free to put the state.

And the real question: is the problem YAML, or the workflow? The LLMs already type the YAML. The pain is drift, debugging, "why didn't this resource update", module versioning. An Elixir glue layer attacks the drift and debugging (a supervised process can watch for drift instead of discovering it at the next plan), but module versioning hell gets swapped for hex package versioning hell, which is nicer, but is a lateral move wearing a better language.

I'll also note I'm not alone out here, which is either validation or a support group. FLAME is Chris McCord arguing that elastic infrastructure should be a function call inside your runtime. Bonny lets you write Kubernetes operators in Elixir. libcluster and Horde handle clustering and distributed supervision. The "your runtime is your infrastructure" idea keeps independently re-emerging from the BEAM community, the way good ideas do and cryptocurrencies also do. Draw your own conclusions.

The Verdict

Keep Terraform for the cloud provider CRUD. The providers are good, the state handling is solved, and the LLM writes it anyway.

But the glue layer? Secrets, rolling deploys, supervising the weird Rust binary, noticing at 3 a.m. that something died. Right now that's a pile of Python scripts and cron jobs held together by hope. It should be a supervision tree.

flowchart LR
    subgraph TF[Terraform<br/>cloud provider CRUD]
        TF1[Instances]
        TF2[Buckets]
        TF3[Databases]
    end

    subgraph BEAM[Elixir/BEAM<br/>the layer that pages you]
        Config[Config manager]
        Deploy[Deploy coordinator]
        Monitor[Monitors + supervision]
        Rust[Supervised native services]
    end

    Deploy -->|drives| TF
    Config --> Deploy
    Monitor --> Rust

Let Terraform keep the nouns. I'm taking the verbs.

Sources

Pulumi's own architecture docs spell out the lifecycle: the program runs to compute a desired-state resource graph, the engine diffs it against the last state and applies the difference, and then "the language host exits as the program has finished running." A batch pulumi up, not a long-running reconciler. See How Pulumi Works. ↩
This is not folklore. Erlang/OTP was built at Ericsson for carrier-grade telecom switches like the AXD301, where extended downtime was simply not on the menu, and that lineage is the whole reason the runtime is good at staying up. See All For Reliability: Reflections on the Erlang Thesis. ↩