Looking at your experiment code, it seems like the retrieval experiments are done with the reconstructed vectors of dimension D rather than the compressed vectors of dimension d, which doesn't have any direct performance improvements. Later on in the post you indicate that the real advantage is that the residuals are more isotropic and therefore you can quantize the pair (p, V_resid) with less quality degradation, but I don't see any experiments actually verifying that retrieval quality holds up in this setting. Also, it's not quite clear to me how you efficiently compute cosine similarity for vectors encoded in this form. Doesn't the V_resid part of the computation require something significantly more complex than a dot product?
I agree that we don't want to reconstruct the whole vector while retrieval and it makes poly-AE toy-like at the current state non production ready. My main interest here in the just taking more recall pp in closed form. And then think about how to make it fast. In all threads I got a good intermediate thoughts about the topic which may help me to bring to closer to production form
This is a neat idea. When I'm looking up models I usually want to see something about the architecture, but also some of the hyperparameters for the specific model---residual dimension, total number of layers, tokenizer configs. There's some of that in the visualization but it's spotty.
The results for Nemotron 3 Nano are hard to parse, and I think actually incorrect: https://hfviewer.com/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-B... I'm guessing this is because the implementation uses layers that are all instances of the same class, with forward passes that branch on the layer type specified at construction time.
Hi, I'm from Embedl (embedl.com / https://huggingface.co/embedl) and we made the hfviewer. Could you please elaborate more on why the Nemotron model visualization might be incorrect? A number of passes are performed to get the graph structure from the HuggingFace conf including sometimes exporting the model with torch.export and the recombining it to make the view meaningful. We would love to fix any issues and make the viewer better.
The Nemotron model has attention layers interspersed with the Mamba layers, and I didn't see any attention layers in the model. It looks like the attention layers are present but show up as blocks with an RMSNorm followed by two sequential linear layers. The first few resolution levels aren't very useful either.
It maps (-inf, inf) to (0, inf) in about as nice a way as you could expect (addition turns into multiplication). When you want to constrain a value to be positive, parameterizing it with exp is usually a good option.
The "model card" concept actually comes from a pre-LLM Google paper (https://arxiv.org/abs/1810.03993), where the example cards did fit on a single page. The concept quickly became a standard component of AI governance frameworks, and Hugging Face adopted it as a reasonable standard format for a model README. As LLMs emerged and became more capable at broader ranges of tasks, model cards expanded to the sizes we see today.
That makes sense. I recall a “battle card“ (“concise, easy-to-scan document that helps [sales] reps handle competitive conversations, respond to objections, and highlight key differentiators” per HubSpot) as about a half sheet document, which is congruent.
There's some validity to these criticisms, but it would be a lot more credible to cite someone whose job isn't "loudly promote any claim that sounds negative for AI, regardless of how well-founded it is."
It would be twice that, since nVidia always lists "with sparsity" FLOPS as the headline number. But I bet they got a bunch of research credits to do this.
I've been waiting for them to publish the 4B model for a while so I'm glad to have something similar to play with. I think I trust the Ranke-4B process a bit more, but that's partly because there aren't a lot of details in this report. And actually releasing a model counts for a whole lot.
One thing that I think will be a challenge for these models is achieving any sort of definite temporal setting. Unless the conversation establishes a clear timeframe, the model may end up picking a more or less arbitrary context, or worse, averaging over many different time periods. I think this problem is mostly handled by post-training in modern LLMs (plus the fact that most of their training data comes from a much narrower time range), but that is probably harder to accomplish while trying to avoid bias in the SFT and RL process.
I wonder if it would be possible to do something simple like prepending sentinel tokens with the year. Or, since they're training a model from scratch anyways, tweak the architecture to condition on a temporal embedding. That opens the door to cool stuff like: Generate a response from 2050.
reply