More

jeeceebees · on Aug 6, 2024

You can just mask the output probabilities for each token based on which options are valid according to a grammar.

There are quite a few open source implementations of this e.g. https://github.com/outlines-dev/outlines

contravariant · on Aug 6, 2024

You could simply censor invalid tokens, but that does rely on 2 assumptions.

1. There is always a valid next token.

2. This greedy algorithm doesn't result in a qualitatively different distribution from a rejection sampling algorithm.

The latter isn't too obvious, and may in fact be (very) false. Look up maze generation algorithms if you want some feeling for the effects this could have.

If you just want a quick argument, consider what happens if picking the most likely token would increase the chance of an invalid token further down the line to nearly 100%. By the time your token-picking algorithm has any effect it would be too late to fix it.

throwawaymaths · on Aug 6, 2024

Sorry, how could there not be a valid next token? Presumably your interface would generate a state machine with appropriate masking arrays, and iirc generally speaking all 256 byte choices are in the token list. There's no way to get stuck in a place where the JSON is invalid? Can you give an example?

If you want to be really clever about your picker, a deterministic result would blat out the all the known possible strings.

For example, if you had an object with defined a defined set of properties, you could just go ahead and not bother generating tokens for all the properties and just tokenize, E.G. `{"foo":"` (6-ish tokens) without even passing through the LLM. As soon as an unescaped `"` arrives, you know the continuation must be `,"bar":"`, for example

> This greedy algorithm doesn't result in a qualitatively different distribution from a rejection sampling algorithm.

It absolutely will. But so will adding an extra newline in your prompt, for example. That sort of thing is part and parcel of how llms work

contravariant · on Aug 6, 2024

Hmm, I think any example where it can get stuck is going to be a bit contrived since really it's a question of how easy it is to recognize a valid prefix. Say for example you want the LLM to generate a valid chess match and it ends up in a situation with just 2 kings left. If you're not careful with your definitions you could end up in an endless loop that never ends.

That said if you know all valid prefixes in your language in advance then you can always realise when a token leaves no valid continuations.

> It absolutely will. But so will adding an extra newline

A newline is less likely to dramatically drop the quality, a greedy method could easily end driving itself into a dead end (if not grammatically then semantically).

Say you want it to give a weather prediction consisting of a description followed by a tag 'sunny' or 'cloudy' and your model is on its way to generate

    { 
      desc: "Strong winds followed by heavy rainfall.", 
      tag: "stormy" 
    }

If it ever gets to the 's' in stormy it will be forced to pick 'sunny', even if that makes no sense in context.

arjvik · on Aug 6, 2024

Schema needs to be a part of the prompt as well so it can associatively recall the options

jeeceebees · on Jan 17, 2022

I think this is a property spheres. It seems to me that any two spheres that are touching have a straight line from one center to the other center exactly through the point of contact. Try thinking of just two spheres and adding more in step-by-step.

Then the result follows because all the spheres are defined as centered on the cube/sub-cubes respectively.

spenczar5 · on Jan 17, 2022

The inner sphere is not defined as centered on the cube; it is defined as touching all the other spheres.

That said, there is a symmetry argument that if it were centered anywhere else, something is wrong. But that only works if there is only one unique sphere that touches all the other spheres, which is also not obvious to me in higher dimensions.

Dylan16807 · on Jan 17, 2022

You can go ahead and define it as centered on the cube. That still demonstrates the strange nature of high-dimensional spheres even if there wasn't a unique solution for touching all the other spheres.

spenczar5 · on Jan 17, 2022

Aha! Right, this is pretty convincing to me. Thanks!

jeeceebees · on Nov 16, 2021

How does the performance between GPU programs written with std::par compare to those written in CUDA?

Do you happen to know of any online resources that show a comparison of the kernel code and performance of the two frameworks on common tasks?

volta83 · on Nov 16, 2021

This paper ported a CFD application, which had a tuned CUDA implementation, to std::par: https://arxiv.org/pdf/2010.11751.pdf .

In Table 3, first and last columns shows the performance of CUDA and std::par in % of theoretical peak.

The rows show results for different GPU architectures.

On V100, CUDA achieves 62% theoretical peak and std::par 58%.

The amount of developer effort required to achieve over 50% theoretical peak with std::par makes it a no brainer IMO.

If there is one kernel where you need more performance, you can always implement that kernel in CUDA, but for 99% of the kernels in your program your time might be better spent elsewhere.

jeeceebees · on Sept 13, 2021

There is a lot of evidence that these token-based models work with multi-modal data. In fact, several groups have proposed different multi-modal transformer architectures already (e.g. [1] or [2]), although I don't believe anyone has scaled them up much farther than 300M parameters yet.

If these models are shown videos of butterflies flapping their wings with a text description of 'a butterfly flapping its wings,' why wouldn't you expect it to start to relate the information coming from multiple modalities?

It's definitely a challenge to get enough high-quality data to feed a 100B parameter version of such a mutli-modal model, but there don't seem to be any theoretically insurmountable issues towards this "dumb" way of giving the models more intuition.

[1] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, https://arxiv.org/abs/2104.11178

[2] Perceiver IO: A General Architecture for Structured Inputs & Outputs, https://arxiv.org/abs/2107.14795

jeeceebees · on Aug 24, 2021

Having more memory to address means you need more circuits that direct your reads/writes to the right place. Travelling a longer distance / through more complicated routes means the latency for each request will be higher on average.

jeeceebees · on May 31, 2021

LSTM stands for Long Short Term Memory. It's a recurrent network that learns what and how long things should be kept in its internal state buffer. It doesn't have a fixed state size because it's just learning a nonlinear function that takes an input and a state to an output and a new state. Obviously it can't model all possible, infinite length recurrences, but it can definitely do a pretty good job of approximating long term recurrence relations in complex signals.

lunixbochs · on May 31, 2021

I don't think that assessment is quite right. The hidden size is fixed - the second argument to Pytorch's nn.LSTM constructor is "hidden_size – The number of features in the hidden state h".

A call to `y, hidden = layer.forward(x)` (where x has a batch size of 1, and an arbitrary length) produces two hidden states of dimensions `(1, 1, hidden_size)`, where hidden_size is the exact number you passed to the LSTM constructor. Those two states represent the long term and short term memory features.

You would need to have an LSTM with hidden_size large enough to store the samples (or a compressed representation) of your entire loop. Not to mention you'd run into other issues with handling the logic around variable length loops based on a pedal toggle.

jeeceebees · on June 1, 2021

The hidden state isn't storing the samples of your loop (or a compressed version of your loop). It's encoding a representation of how the output will change based on what the current state and input are. This might be strongly dependent on what the exact samples in the loop are, but it could also be more general. I think it's missing a bit of the representational power of an LSTM to see the state representation as just a buffer of the current input.

But, yeah, at some point your signal has such a complex behavior on long time scales that there isn't a good way to predict it based on a limited state size (or at least gradient descent can't find a function to predict it for you).

lunixbochs · on June 2, 2021

If you can reproduce the original information based only on a state input, you have stored it in the state (in an encoded form or not). If your state is smaller than the original information, you have compressed it. If your reproduction is not faithful to the original, you have created lossy compression.

If the future input samples have a meaningful impact during loop playback, then it hasn't learned the correct behavior of the original loop pedal.

Note that the linked project appears to use a hidden size of 20. Twenty floats. With that much space we're very much back to "sure, you might theoretically be able to loop if the information fits in the hidden size".

Increasing the hidden size beyond 20 still won't solve learning the complex state machine behavior of an original loop pedal, which can loop variable length audio. You'd need to provide the pedal state to the network in addition to the audio, and probably train need to train it on a bunch of different loop lengths (>thousands?).

This would mostly be an academic pursuit, as it's extremely impractical compared to the other uses of the device.

jeeceebees · on Aug 13, 2020

I think most human creativity is built around a seed of inspiration from outside sources.

In my experience serendipity and happy accidents are exactly what leads to the most creative and interesting outcomes.

This is a system that’s able to generate these inspirations/prompts for you in a way that’s more focused on what you’ve already written. I don’t see it as ironic because I don’t think it’s hijacking your creative potential, just augmenting it.

jeeceebees · on May 30, 2020

I think the larger models get, the more incentive there is for researchers to look into pruning/distilling them for practical use.

GPT-1,2,3 et al. have all shown that larger is better. While in the short term this means people will simply throw larger and larger clusters at the problem, in the longer term there needs to be inovation in making it more efficient on the clusters we have (as even the cloud has limits).

I think sheer parameter count is an important part of the equation in general intelligence, so it's important that there are labs that work on scaling up promising leads to trillions of parameters on top of labs thinking of new promising directions.

jeeceebees · on May 27, 2020

It's YYMM.ID I believe

Last year may was 1905.xxxxxx

solveit · on May 27, 2020

Yep. This and the last year will be the only two confusing years for a century.

lifeisstillgood · on May 27, 2020

aha. Why are they trying to save 2 bytes worth of space? Is it legacy ?

jeeceebees · on Feb 12, 2020

Have all novel qubits gotten nobel prizes so far?

I think there's more than enough room in between bullshit and nobel prize.

This looks very promising, but as always the devil is in the details. The next steps are multiqubit gates, then linking those up into useful quantum circuits, and then hopefully doing actual quantum computations.

Personally I wouldn't expect a nobel prize until one of those last 2 steps. There's a lot of hurdles before then to run into some nasty problems.