More

zackangelo · 2026-04-16T15:00:42 1776351642

They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

sosodev · 2026-04-16T15:14:20 1776352460

Oh, that’s interesting. Thanks for the correction. I didn’t know such heavily post trained models could still do good ol fashion autocomplete.

zackangelo · 2026-04-16T14:57:47 1776351467

17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.

If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).

When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.

zackangelo · 2026-01-10T23:56:27 1768089387

This uses Nvidia’s CUDA snapshot API under the hood, but you have to pair it with a host side snapshot as well. Modal uses gVisor for this, which is notoriously high overhead.

Does anyone know of a more efficient alternative if you’re running a trusted container?

luiscape · 2026-01-11T17:07:12 1768151232

Post author here: there are other projects that will create a proxy for CUDA calls and use the log of CUDA operations to checkpoint / restore or live migration tasks. We haven’t used them. I don’t believe they are very popular nor used outside specific orgs.

This is the only API available for snapshotting NVIDIA GPU memory, afaik.

As for needing to combine it with a host memory snapshot step, this is required because CUDA sessions need to be mapped to a host process, so you need to snapshot both things in order for the program to be restored correctly.

CRIU is another project that uses the same technique (CUDA snapshot + host memory snapshot). Different than CRIU, our snapshots work at the function level so we’re able to take snapshots after functions have been initialized (including GPU memory), making Modal cold boots fast. One would have to implement this entire process using CRIU.

zackangelo · 2025-12-12T23:05:09 1765580709

Sparks are built for this and actually have Connect-X 7 NICs built in! You just need to get the SFPs for them. This means you can natively cluster them at 200Gbps.

wtallis · 2025-12-12T23:18:33 1765581513

That doesn't answer the question, which was how to get a high-speed interconnect between a Mac and a DGX Spark. The most likely solution would be a Thunderbolt PCIe enclosure and a 100Gb+ NIC, and passive DAC cables. The tricky part would be macOS drivers for said NIC.

zackangelo · 2025-12-12T23:26:39 1765581999

You’re right I misunderstood.

I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.

You could run pipeline parallel but not sure it’d be that much better than what we already have.

storus · 2025-12-13T13:54:55 1765634095

It was about this use case:

https://blog.exolabs.net/nvidia-dgx-spark/

zackangelo · 2025-12-12T23:00:10 1765580410

No you use tensor parallelism in both cases.

The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.

EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)

liuliu · 2025-12-12T23:25:42 1765581942

I usually call it "head parallelism" (which is a type of tensor parallelism, but paralllelize for small clusters, and specific to attention). That is what you described: sharding input tensor by number of heads and send to respective Q, K, V shard. They can do Q / K / V projection, rope, qk norm whatever and attention all inside that particular shard. The out projection will be done in that shard too but then need to all reduce sum amongst shard to get the final out projection broadcasted to every participating shard, then carry on to do whatever else themselves.

I am asking, however, is whether that will speed up decoding as linearly as it would for prefilling.

awnihannun · 2025-12-13T00:33:43 1765586023

Right, my comment was mostly about decoding speed. For prefill you can get a speed up but there you are less latency bound.

In our benchmarks with MLX / mlx-lm it's as much as 3.5x for token generation (decoding) at batch size 1 over 4 machines. In that case you are memory bandwidth bound so sharding the model and KV cache 4-ways means each machine only needs to access 1/4th as much memory.

liuliu · 2025-12-13T01:20:40 1765588840

Oh! That's great to hear. Congrats! Now, I want to get the all-to-all primitives ready in s4nnc...

zackangelo · 2025-11-07T03:41:43 1762486903

What 1T parameter base model have you seen from any of those labs?

riku_iki · 2025-11-07T03:46:05 1762487165

its moe, each expert tower can be branched from some smaller model.

jychang · 2025-11-11T00:46:55 1762822015

That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.

zackangelo · 2025-10-15T03:39:58 1760499598

Wouldn't you be able to test nccl if you had 2 of these?

ddelnano · 2025-10-15T06:03:39 1760508219

What kind of NCCL testing are you thinking about? Always curious what’s hardest to validate in people’s setups.

moondev · 2025-10-15T04:31:23 1760502683

Not with Mac studio(s), but yes multi host NCCL over RoCE with two DGX Sparks or over PCI with one

zackangelo · 2025-10-08T16:48:47 1759942127

Just a bit of feedback:

> Instead of one brittle giant, we orchestrate a Mixture of Experts…

“mixture of experts” is a specific term of art that describes an architectural detail of a type of transformer model. It’s definitely not using smaller specialized models for individual tasks. Experts in an MoE model are actually routed to on a per token basis, not on a per task or per generation basis.

I know it’s tempting to co-opt this term because it would fit nicely for what you’re trying to do but it just adds confusion.

rgthelen · 2025-10-08T18:39:59 1759948799

I hear you and valid. A mixture of Models is probably a better phrase - we are constantly moving between AI experts and very skilled Developers who use OpenAI endpoints and call it AI, so we are constantly working on finding the correct language. This was a miss though - will do better :)

zackangelo · 2025-10-06T20:36:35 1759782995

Because it depends on how much better “best” is. If it’s only incrementally better than open source models that have other advantages, why would you bother?

OpenAI’s moat will only come from the products they built on top. Theoretically their products will be better because they’ll be more vertically integrated with the underlying models. It’s not unlike Apple’s playbook with regard to hardwares and software integration.

zackangelo · 2025-08-30T17:23:33 1756574613

Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.