punktfunk

Implementation Plan

The full design: protocol core, milestones, and architecture.

A ground-up low-latency desktop streaming stack, built Linux-first, with a shared Rust protocol core and native clients per platform.

punktfunk is a placeholder codename — rename freely. It fits the lowercase house style (unom, played, remplir) and reads as "glass-to-glass light," which is the whole point.


0. The thesis (why this is worth building)

Two concrete gaps justify a new project rather than another fork:

  1. The 1 Gbps wall is a FEC design limit, not a bandwidth limit. Moonlight/Sunshine protect each frame with Reed–Solomon over GF(2⁸), which caps a block at 255 shards. At 5120×1440@240 that ceiling is hit around 1 Gbps. Switching the erasure code to Leopard-RS over GF(2¹⁶) (via the reed-solomon-simd crate) raises the per-block shard limit to 65,536 and runs in O(n log n) with SIMD. The wall disappears as a consequence of a better core, not as a hack.

  2. Linux software virtual displays are a real, unfilled gap. The compositor-side capability now exists (Mutter headless virtual monitors since GNOME 40; wlroots headless outputs; KWin virtual outputs in Plasma 6), but no streaming host drives those APIs to create a client-sized output on demand, capture it via PipeWire, and route input back via libei. Apollo's virtual display is Windows-only. This is the immediate, shippable win.

Strategic ordering: ship the Linux virtual-display host speaking the existing Moonlight protocol first (every Moonlight/Artemis client works on day one, no client to write). Only then introduce the new GF(2¹⁶) transport as a negotiated protocol extension with our own clients. Value early, hard parts deferred until de-risked.


1. Scope & non-goals

In scope (eventually):

  • Linux streaming host with on-demand software virtual displays (KWin first, then wlroots, then Mutter).
  • A shared Rust protocol/transport/FEC core exposed over a stable C ABI.
  • A modern transport that removes the 1 Gbps ceiling.
  • Native clients: Rust (Linux), Swift (macOS/iOS), Kotlin (Android) — all linking the same core.

Explicit non-goals (at least at first):

  • Windows host support (Sunshine/Apollo already do this well; no gap to fill).
  • Internet/NAT-traversal relay infrastructure (LAN/VPN first; you already run Headscale/NetBird — lean on that).
  • Reinventing encoders/decoders (bind to FFmpeg + vendor SDKs; never rewrite codecs).
  • A bespoke compositor (drive existing ones; only consider a dedicated headless compositor as a deployment mode, see §6).

2. Architecture overview

flowchart TD
    subgraph Host["Linux Host (Rust)"]
        VD["Virtual display orchestrator<br/>(KWin / wlroots / Mutter)"]
        CAP["Capture<br/>(PipeWire / dmabuf)"]
        ENC["Encoder<br/>(VAAPI / NVENC via FFmpeg)"]
        VD --> CAP --> ENC
        ENC --> COREH
        IN_H["Input injector<br/>(libei / uinput)"]
        COREH["punktfunk-core (C ABI)<br/>protocol · FEC · pacing · crypto"]
        COREH --> IN_H
    end

    COREH <-->|"UDP+FEC video / QUIC control+audio"| COREC

    subgraph Client["Client (Rust / Swift / Kotlin)"]
        COREC["punktfunk-core (same crate, C ABI)"]
        DEC["Decoder<br/>(VideoToolbox / NVDEC / VAAPI)"]
        PRES["Present + frame pacing"]
        INP["Input capture"]
        COREC --> DEC --> PRES
        INP --> COREC
    end

The load-bearing decision: punktfunk-core is one crate, compiled once, linked by every host and client through a C ABI. Protocol logic, FEC, packet pacing, jitter buffering, pairing, and crypto live there and exist exactly once. Platform code (capture, encode, decode, present, input, UI) lives outside the core and is written in whatever language suits the platform.


3. Protocol strategy (three phases)

PhaseProtocolClients that workBitrate ceilingPurpose
P1GameStream-compatible (existing Moonlight wire format)All existing Moonlight/Artemis clients~1 Gbps (legacy GF(2⁸) FEC)Ship the Linux virtual-display win with zero client work
P2punktfunk/1 negotiated extension: GF(2¹⁶) FEC, multi-block framing, optional QUIC controlpunktfunk clients only; falls back to P1 for othersMulti-GbpsBreak the wall; introduce native clients
P3punktfunk/1 as primary; GameStream kept as compat shimpunktfunk everywhere, Moonlight as fallbackMulti-GbpsFull control of features (mic passthrough, per-client identity, HDR signalling)

Negotiation: extend the serverinfo/RTSP SETUP handshake with a capability flag. Old clients never see the flag and get P1 behavior. This is how Apollo/Artemis diverge cleanly, and it keeps you compatible while you build.


4. Tech stack (settled)

Language split: Rust for the core and all non-Apple platform code; Swift only for the macOS/iOS client UI + VideoToolbox/Metal; Kotlin for Android UI + MediaCodec. The C ABI is the seam.

Threading: native OS threads for the video hot path. tokio is allowed only for the control plane (pairing, web config, QUIC control stream). The per-frame pipeline must never touch an async runtime.

Core crate dependencies

ConcernCrateNotes
FECreed-solomon-simd (v3+)Leopard/GF(2¹⁶), SIMD, O(n log n) — the wall-breaker
QUIC (control/audio)quinnDatagram ext for audio; reliable streams for control
TLS / cryptorustls + ring (or aws-lc-rs)Pairing, session keys (AES-GCM to match GameStream in P1)
Serializationzerocopy / bytesWire structs #[repr(C)], zero-copy parse
C header gencbindgenGenerates punktfunk_core.h from the ABI module
Error/logtracingStructured; feature-gate off the hot path

Linux host dependencies

ConcernCrate / APINotes
Capturepipewire (pipewire-rs)ScreenCast portal stream → dmabuf
Portal / DBusashpd + zbusxdg-desktop-portal: ScreenCast, RemoteDesktop
Encodeffmpeg-next or rsmpegVAAPI / NVENC, dmabuf import (zero-copy)
Input injectreis (libei) + input-linux (uinput fallback)Wayland-native first, uinput as universal fallback
Virtual outputper-compositor (see §6)KWin DBus / Sway create_output / Mutter DBus
Web configaxum + tokio + small Vite/React UIYou own this stack already

Apple client (P2+)

Swift + VideoToolbox (decode) + Metal (present) + SwiftUI. Imports punktfunk_core.h directly via a module map — no glue layer.

Ruled out

  • Swift for the host/core: no Linux Wayland/PipeWire/DRM/VAAPI ecosystem; ARC in hot loops. (Excellent Apple-client language, wrong for systems/Linux.)
  • Go: GC disqualifies the hot path.
  • C++: throws away the safety/concurrency wins that justified greenfield over forking.
  • Zig: best-in-class C interop, but pre-1.0 with no Wayland/QUIC ecosystem — too much risk for a multi-month build. Revisit later if desired.

5. The C ABI boundary

Design it on day one; retrofitting an ABI is painful.

Principles

  • Opaque handles only across the boundary: PunktfunkSession*, never Rust types.
  • All cross-boundary structs are #[repr(C)]; primitives + pointer/len pairs for buffers.
  • Async events via registered C callbacks (fn ptr + void* userdata).
  • Explicit, documented ownership: who frees what, when. Provide punktfunk_*_free for every allocation that crosses out.
  • Versioned ABI: uint32_t punktfunk_abi_version(void) + a PunktfunkConfig struct whose first field is its own size for forward-compat.

Minimal surface (sketch)

// lifecycle
PunktfunkSession* punktfunk_session_new(const PunktfunkConfig* cfg);
void          punktfunk_session_free(PunktfunkSession*);

// host: feed an encoded access unit (the core does FEC + packetize + pace + send)
int punktfunk_host_submit_frame(PunktfunkSession*, const uint8_t* data, size_t len,
                            uint64_t pts_ns, PunktfunkFrameFlags flags);

// client: pull a reassembled, FEC-recovered access unit ready to decode
int punktfunk_client_poll_frame(PunktfunkSession*, PunktfunkFrame* out /*borrowed until next poll*/);

// input (both directions): client captures, host receives via callback
int  punktfunk_send_input(PunktfunkSession*, const PunktfunkInputEvent*);
void punktfunk_set_input_callback(PunktfunkSession*, PunktfunkInputCb, void* user);

// stats for the frame-pacing/quality logic and the web UI
void punktfunk_get_stats(PunktfunkSession*, PunktfunkStats* out);

Keep it this small. Everything platform-specific (how you got the encoded bytes, how you decode them) stays on the platform side.


6. Virtual display orchestration

This is the differentiator and the most fragmented part. Two deployment models — support both eventually, pick one for the MVP.

Model A — Attach to the running session. Create a client-sized virtual output inside the user's live desktop, stream it, tear it down on disconnect. This is "add a monitor to my actual PC." Best UX, hardest because it depends on per-compositor runtime APIs.

Model B — Dedicated headless session. Spawn a separate headless compositor purely for the stream (e.g. gnome-shell --headless --virtual-monitor WxH, or a headless wlroots compositor). Cleaner isolation, sidesteps runtime-output APIs, ideal for "remote second PC." Worse for "mirror/extend my real desktop."

Per-compositor (Model A) runtime virtual-output creation:

  • KWin / Plasma 6 (recommended MVP target — matches your CachyOS/KDE daily driver and where the gap is loudest): KWin can create virtual outputs; KRdp already does this internally for remote sessions. Drive it via the KWin DBus interface; capture via xdg-desktop-portal-kde ScreenCast (PipeWire); inject input via the RemoteDesktop portal or reis.
  • wlroots (Sway/Hyprland — fastest to prototype the pipeline): enable the headless backend (WLR_BACKENDS=…,headless), then swaymsg create_output / hyprctl output create headless. Capture via wlr-screencopy or the portal. Simplest API; good for validating capture→encode→send before fighting KWin/Mutter.
  • Mutter / GNOME: virtual monitors via the headless backend; runtime creation via Mutter DBus (org.gnome.Mutter.* — partly experimental). Capture via xdg-desktop-portal-gnome ScreenCast.

Recommendation: do a 1–2 day wlroots spike to prove the pipeline, then build the real MVP on KWin because that's your deployment target. Abstract virtual-output creation behind a trait so compositors are pluggable:

trait VirtualDisplay {
    fn create(&self, mode: Mode) -> Result<OutputHandle>;
    fn destroy(&self, h: OutputHandle) -> Result<()>;
}

7. The hot path: pipeline & latency budget

Per-frame pipeline, each stage on its own thread, connected by bounded SPSC channels (drop-oldest on overflow, never block the encoder):

capture(dmabuf) → encode(NVENC/VAAPI) → core[FEC+packetize+pace+send]
                                                      │ network
client: recv → core[reorder+FEC recover+jitter] → decode → present

Glass-to-glass budget (LAN, 240 Hz = 4.17 ms/frame):

StageTargetNotes
Capture latency≤ 1 framedmabuf, no copy to CPU
Encode1–4 msNVENC low-latency preset; tune lookahead off
FEC + packetize< 1 msSIMD RS; pre-allocated shard buffers
Network (LAN)< 1 mssendmmsg / UDP GSO to cut syscalls
Jitter buffer0–1 frameadaptive; minimum that hides observed jitter
FEC recover + reassemble< 1 msonly when loss occurs
Decode1–4 mshardware decoder
Present≤ 1 framealign to client vsync

Target: 15–35 ms glass-to-glass on LAN. The art is frame pacing — matching capture/encode cadence to the client's actual refresh and keeping the jitter buffer as small as the link allows. This, not the codec, is what separates good from bad streaming. Budget real time for it.

Throughput math to keep honest: 5120×1440@240 ≈ 1.77 Gpx/s. At 0.5 bpp that's ~885 Mbps; 0.6 bpp ≈ 1.06 Gbps; 0.8 bpp (4:4:4 headroom) ≈ 1.4 Gbps. The GF(2¹⁶) FEC + multi-block framing must sustain these without the per-frame shard count being the limiter — which it no longer is once you leave GF(2⁸).


8. Milestones

Sizing is rough and relative (Spike / S / M / L) for a focused solo dev; treat as ordering, not deadlines.

M0 — Pipeline spike (S). wlroots headless output → PipeWire capture → VAAPI/NVENC encode → dump H.265 to a file that plays. Acceptance: a valid encoded file from a virtual output, no streaming yet. Proves the Linux capture+encode chain end-to-end.

M1 — punktfunk-core skeleton + C ABI (M). Session lifecycle, GameStream-compatible packetization and GF(2⁸) FEC (P1), AES-GCM, cbindgen header, a tiny C test harness. Acceptance: core links from C; round-trips packets in a loopback test with simulated loss.

M2 — P1 host: stream to stock Moonlight (L). Wire M0's pipeline into the core; implement serverinfo/pairing/RTSP enough for a real Moonlight client to connect, with a KWin virtual output created on connect and destroyed on disconnect. Input via reis/uinput. Acceptance: you play a game on your KDE box streamed to a stock Moonlight client on a virtual display, no dummy plug, no kernel args. This is the shippable milestone and the project's reason to exist.

M3 — Measurement harness (S). Glass-to-glass latency measurement (on-screen QR/timestamp or photodiode), packet-loss injection, frame-pacing and stall metrics surfaced in the web UI. Acceptance: you can quantify a regression. Build this before optimizing anything.

M4 — P2 transport: break the wall (L). Add punktfunk/1 negotiation; swap to reed-solomon-simd GF(2¹⁶) with multi-block per-frame framing; optional QUIC control/audio. Write a minimal Rust reference client (decode via VAAPI, present via wgpu/Vulkan) to exercise it. Acceptance: a stable stream above 1.4 Gbps at 5120×1440@240 with loss recovery working; latency unchanged vs. M2.

M5 — Apple client (L). Swift + VideoToolbox + Metal + SwiftUI, linking punktfunk-core via the C header. Acceptance: the Mac Studio plays a stream at native resolution/refresh.

M6 — Feature surface (M, ongoing). Mic passthrough as a proper encrypted, per-client reverse audio stream (the thing the upstream PR got wrong); HDR signalling; per-client identity/permissions; pause/resume. Acceptance: feature parity with Apollo on the items you care about, plus mic done right.


9. Risk register

RiskLikelihoodImpactMitigation
KWin runtime virtual-output API is undocumented/unstableHighHighSpike on wlroots first to de-risk the pipeline; study KRdp's source for the KWin path; keep VirtualDisplay pluggable so a stuck compositor doesn't block the project
Wayland input injection gaps (libei still evolving)MedMeduinput fallback always available; reis for the Wayland-native path
dmabuf → encoder zero-copy import quirks per GPU/driverHighMedValidate on your actual NVIDIA + AMD hardware early (M0); have a CPU-copy fallback path
Encoder/decoder can't sustain 1.77 Gpx/s @ 240MedHighMeasure in M0/M4 on real silicon; this is a hardware ceiling no rewrite fixes — discover it before P2, not after
Frame pacing eats more time than expectedHighMedM3 measurement harness first; treat pacing as a first-class subsystem, not a polish step
Scope creep into a full Moonlight replacementHighHighP1 (stock-client compat) is the firewall: it forces you to ship value before writing a client
Solo bandwidth vs. your other projects (ENRW thesis, played)HighMedM2 is a complete, useful artifact on its own; the plan is safe to pause after any milestone

10. Testing & measurement

  • Loopback correctness: core encodes→FEC→loss-inject→recover→decode in-process; property tests over loss patterns and shard counts (proptest).
  • Glass-to-glass latency: rendered timestamp/QR on host, read back on client capture; or a photodiode for true photons. Track p50/p99.
  • Loss resilience: tc netem to inject loss/jitter/reorder; verify FEC recovery and graceful degradation.
  • Pacing: log present timestamps vs. client vsync; alert on stalls and duplicate/dropped frames.
  • Soak: multi-hour streams; watch for buffer growth, fd leaks, encoder session exhaustion.
  • Hardware matrix: your NVIDIA box (NVENC), an AMD/Intel box (VAAPI), Mac Studio (VideoToolbox decode). Catch driver quirks early.

11. Repo / workspace structure

punktfunk/
├── Cargo.toml                # workspace
├── crates/
│   ├── punktfunk-core/           # protocol, FEC, pacing, crypto — C ABI (cdylib + staticlib)
│   │   ├── src/abi.rs        # #[no_mangle] extern "C" surface
│   │   ├── src/fec.rs        # GF(2^16) blocking over reed-solomon-simd
│   │   ├── src/transport/    # udp+fec video, quinn control/audio
│   │   ├── src/protocol/     # gamestream-compat (P1) + punktfunk/1 (P2)
│   │   └── cbindgen.toml
│   ├── punktfunk-host/           # Linux host binary
│   │   ├── src/capture/      # pipewire / portal
│   │   ├── src/encode/       # ffmpeg vaapi/nvenc
│   │   ├── src/vdisplay/     # trait + kwin/wlroots/mutter impls
│   │   ├── src/input/        # reis + uinput
│   │   └── src/web/          # axum config/pairing API
│   └── punktfunk-client-rs/      # reference Rust client (M4)
├── clients/
│   ├── apple/                # Swift package, imports punktfunk_core.h (M5)
│   └── android/              # Kotlin + JNI (later)
├── include/                  # generated punktfunk_core.h
└── tools/
    ├── latency-probe/
    └── loss-harness/

12. Immediate next actions (first week)

  1. Stand up the workspace with punktfunk-core (empty ABI + cbindgen) and punktfunk-host skeletons; CI on your Gitea (you already have BuildKit pipelines).
  2. M0 spike on wlroots: headless output → PipeWire capture → NVENC/VAAPI encode → playable file. This validates the riskiest pipeline assumptions in days, on your real GPU.
  3. Read KRdp's source for how KDE creates virtual outputs and casts them — it's the closest existing reference for the KWin path you'll need in M2.
  4. Decide P1 protocol depth: confirm exactly which serverinfo/RTSP/pairing messages a current Moonlight client requires for a successful connect, so M2's compat surface is scoped precisely (this is also the question to take back to the dev who mentioned the 1G limit).

The shape of the bet: M2 alone — virtual-display streaming to stock Moonlight clients on Linux — is a complete, useful, gap-filling release. Everything after it (the wall-breaking transport, native clients, mic-done-right) is upside you unlock from a position of having already shipped, with the hard transport work resting on a FEC core that makes the 1 Gbps ceiling a thing of the past rather than a thing to hack around.

On this page