Hacker Newsnew | past | comments | ask | show | jobs | submit | frabonacci's commentslogin

Hi HN, Francesco from Cua here. I hacked this together over a weekend after getting curious about whether macOS could support real background computer-use outside a single vendor's agent product.

The first thing we are using it for is recording product demos. We used to use Screen Studio; now we ask Claude Code + cua-driver to drive the app while cua-driver recording start captures the trajectory, screenshots, actions, and click markers. We canceled our Screen Studio subscription, which started as a joke and then became true.

The problem: most GUI agents still assume the desktop has one shared cursor, one focused app, and one human who is okay being interrupted. That makes local desktop agents awkward. The agent can do the task, but it steals your screen while doing it.

cua-driver is our attempt to make background computer-use a commodity primitive for macOS: let an agent drive a real Mac app while your cursor, focus, and Space stay where they are. The default interface is a CLI, so it is easy to script, easy for coding agents to call from a shell, and still compatible with MCP clients when you want that.

You can try it on macOS 14+:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-d...)" CLI example:

cua-driver serve &

cua-driver recording start ~/cua-trajectories/demo1

cua-driver launch_app '{"bundle_id":"com.apple.calculator"}'

cua-driver list_windows '{"pid":12345}'

cua-driver get_window_state '{"pid":12345,"window_id":67890}'

cua-driver click '{"pid":12345,"window_id":67890,"element_index":14}'

cua-driver recording stop

The recording command writes turn-NNNNN/ folders with the post-action app state, screenshot, action JSON, and a click.png marker overlay for click-family actions. You can replay a saved run with cua-driver replay_trajectory '{"dir":"~/cua-trajectories/demo1"}', which is useful for regression captures even when you are not trying to make a polished marketing video.

What made this harder than expected:

- CGEventPost warps the cursor (it goes through the HID stream, same one your physical mouse uses)

- CGEvent.postToPid doesn't warp the cursor but Chromium silently drops the event at the renderer IPC boundary

- Activating the target first raises the window AND drags you across Spaces on multi-monitor setups

- Electron apps stop keeping useful AX trees alive when their windows are occluded, unless you register the observer through a private remote-aware SPI

The unlock was a private Apple framework called SkyLight. SLEventPostToPid is a sibling of the public per-pid call, but it travels through a WindowServer channel Chromium accepts as trusted. Pair it with yabai's focus-without-raise pattern (two SLPSPostEventRecordTo calls, deliberately skip SLPSSetFrontProcessWithOptions) plus an off-screen primer click at (-1, -1) to tick Chromium's user-activation gate, and the click lands without the window ever raising.

The thing we learned while building it: the primary addressing mode should not be pixels. cua-driver exposes ax, vision, and som (set-of-marks) modes, but element-indexed AX actions are the happy path. Pixels are the fallback for canvas/WebGL/video surfaces. That makes agents much less brittle because they can click "the Send button" instead of guessing coordinates, while still having a screenshot when the AX tree is ambiguous.

Other things we have used it for:

- A dev-loop QA agent that reproduces a visual bug, edits code, rebuilds, and verifies the UI while my editor stays frontmost

- A personal-assistant style flow that sends a Messages reply without switching Spaces

- Pulling visual context from Chrome/Figma/Preview/YouTube windows I am not looking at

Long technical writeup: https://github.com/trycua/cua/blob/main/blog/inside-macos-wi...

I would especially like feedback from people building Mac automation, agent harnesses, MCP clients, or accessibility tooling. If you try it and it breaks on an app you care about, that is useful data.


what's even more alarming is how exploitable GitHub Trending itself is these days. you can get the star count to fork ratio right and you land on the front page, which then pulls in real organic stars

it reminds me of https://github.com/dockur/windows with its compose-style YAML over QEMU/KVM. The difference i'm seeing is scope: dockur ships curated OS images (Windows/macOS), while holos looks more like a generic single-host VM runner. Is that a fair read? also curious any plans to support running unattended processes for OS installs?

True and decent read. Holos is a generic runner underneath, and I'm not trying to compete on the curated-OS-image front; the assumption is you bring a cloud image and cloud-init does the rest. Unattended installs for Linux work today through cloud-init (that's most of what the YAML drives). Windows/macOS would need a different path.. autounattend.xml injection for Windows, custom payload for macOS; I haven't built that. If there's interest, a provisioner: type block that picks the right unattend mechanism based on OS isn't crazy, but it's not on the near-term list.

Thanks - trajectory export was key for us since most teams want both eval and training data.

On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.

Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.


Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.


Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?


The author's differential testing (2.3M random battles) is great as final validation, but the real lesson here is that modular testing should happen during the port, not after.

1. Port tests first - they become your contract 2. Run unit tests per module before moving on - catches issues like the "two different move structures" early 3. Integration tests at boundaries before proceeding 4. E2e/differential testing as final validation

When you can't read the target language, your test suite is your only reliable feedback. The debugging time spent on integration issues would've been caught earlier with progressive testing.


The real lesson... I mean, if all of this took 1 month, the TFA already did amazingly well. Next time they'll do even better, no doubt.


Thanks! On API call visibility - Lume's MCP interface doesn't expose outbound network traffic directly. It's focused on VM lifecycle (create, run, stop) and command execution, not network inspection.

For agent observability, we handle this at the Cua framework level rather than the VM level:

- Agent actions and tool calls are logged via our tracing integration (Laminar, OpenTelemetry) - You can see the full decision trace - what the agent saw, what it decided, what tools it invoked - For the "what HTTP requests actually went out" question, proxying is still the right approach. You could configure the VM's network to route through a transparent proxy, or set up mitmproxy inside the VM. We haven't built that into Lume itself since network inspection feels orthogonal to VM management.

That said, it's an interesting idea - exposing a proxy config option in Lume that automatically routes VM traffic through a capture layer. Would that be useful for your workflow?


MDM platforms can skip Setup Assistant, but they require the device to be pre-enrolled in Apple Business Manager before first boot - VMs can't be enrolled in ABM, so those hooks aren't available.

defaults write only works after you have shell access, which means Setup Assistant is already done.

There are tools that modify marker files like .AppleSetupDone via Recovery Mode, but that's mainly for bypassing MDM enrollment on physical Macs - you'd still need to create a valid user account with proper Directory Services entries, keychain, etc.

The VNC + OCR approach is less elegant but works reliably without needing to reverse-engineer macOS internals or rely on undocumented behaviors that might break between versions.


Surely your VNC script is guaranteed to break between versions


Thanks! On graphics - currently it's paravirtualized via Apple's Virtualization Framework, so basic 2D acceleration but no GPU passthrough. Fine for desktop use, web browsing, coding, productivity apps. Wouldn't recommend it for anything GPU-intensive though.

Good news is there are hints of GPU passthrough coming (_VZPCIDeviceConfiguration symbol appeared in Tahoe's Virtualization framework), so that might land in a future macOS release. We're keeping an eye on it.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: