Absolutely. Who cares if the LLM automates some of the grunt work? Mathematicians are artists, and they paint with ideas. The goal is map out more of the beautiful structure of how things work. The enjoyment in it derives entirely from the payoff of seeing the larger view of how things fit together. If part of their process involves bouncing things off of other people, or even LLMs, I don't think it matters much, nor does it take away from the enjoyment in getting things figured out.
It's close but not the same. Kernel PCA lifts all D coordinates which gives M around 525k at D = 1024. In the post I do PCA first to reduce D to d = 256, then lift only those d coordinates, M = 33k. Much smaller, much faster Ridge solve.
That makes sense. If you could magically just get the top d PCs in quadratic kernel space without having to compute the whole kernel matrix, and then just do top-d quadratic PCs -> ridge, would that be better than doing the PCA -> top-d -> quadratic kernel ridge as you are now?
"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."
Mathematically, it is literally a probability distribution, because it fits the definition of a measure whose total mass is one, so I think the language is just imprecise. What they may be trying to say is that semantically it doesn't arise in a principled way from an uncertainty model, such as from Bayesian or frequentist statistics.
Hogwash. If you get into deriving maximum entropy distributions via the calculus of variations, the multinomial is the maximum entropy distribution among categorical distributions.
This is exactly the sense that it comes up for old school LMs and why it appears in thermodynamics.
Of course it is entirely possible that newfangled ML people use it without understanding that it is derived from first principles - i.e. see article.
That definitely could be the case. I was also a bit surprised by what the article said, so I was simply trying to interpret it, but I'm not extremely well versed in ML so I could be missing some details. My main point was that contrary to what the article said, they do in fact have a probability distribution on their hands.
You have a relatively small dictionary of tokens, each prediction has a neural network score that goes into the final token prediction layer, and they are trained based on a log-softmax (i.e. the above function) to predict their next token.
This is exactly how anyone in any field does conditional multinomial/categorical (i.e. one of a bunch of distinct tokens) distributions, and AFAIK what LLMs generally use as their loss functions on the output layer, though I have not deeply investigated all of them, since this has been how you do that since time immemorial.
I am extremely confused by all of the people screaming it's not a probability distribution?!?!?
I have seen computer vision tasks use binomial training objectives (one-vs-all) and then use the multinomial only at inference time, and that could be fair that that is not a probability distribution induced by training (while technically a probability distribution only in the sense it is \ge 0 and sums to 1).
But afaik token prediction LLMs that I am aware of use the softmax for the probability in their loss function, i.e. the maximize log softmax.
The comment in parenthesis mentions "they're not derived from a probability space" [1]. I don't know about probability spaces nor softmax to know what part of a probability space this is missing compared to other probability distributions, nor how other probability distributions satisfy probability spaces.
Sounds like they're saying that since the distribution doesn't come from measuring or calculating the probability of something, it has the form of a probability distribution but isn't really one. Like saying 5 feet is a height that a person can have, but since I just made up that number it's not actually a person's height.
The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.
The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.
Softmax isn't a loss function. It is used to transform model outputs into positive numbers that sum to 1, so that they can be interpreted as probabilities, and then those numbers are passed into (typically) the cross entropy loss function. I think you mean, which models are trained using some function other than softmax to transform the model outputs. There are a number of alternatives to softmax, such as the ones described here https://www.emergentmind.com/topics/sparsemax
They’re not. Cross entropy loss is E[-log q] where q is a probability. You could convert the model outputs x into probabilities using some other function like q = 1/Z x^2, and compute cross entropy loss just fine.
iirc, there is a bunch of formal
machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)
It seems like some kind of technique is needed that maximizes information transfer between huge LLM generated codebases and a human trying to make sense of them. Something beyond just deep diving into the codebase with no documentation.
This line of reasoning makes no sense when the AI can just be given access to a fuzzer. I would guess that it probably did have access to a fuzzer to put together some of these vulnerabilities.
> So the data cannot possibly tell you anything about how likely is the observed outcome, because the observed outcome is the only outcome that you observe.
This could also be viewed as supporting the Bayesian perspective, where the observed data are not viewed as random variables - they are fixed. This is because, as you say, the observed outcome is the only outcome that you observe. It is the classical setting, in comparison, where we instead do our analysis by treating the sample as a random variable, placing the counterfactual on other non-observed values ("what if I had drawn a different sample?"), even though we didn't. Bayesian methods treat the data as gospel truth, and place the counterfactual on the different parameters ("what if the population were different?"), even though it isn't.
The other criticism you have is
> The problem with this approach is that we can only observe ONE level of treatment effectiveness, i.e., the level of treatment effectiveness that the treatment actually possesses. All other possible levels of effectiveness are entirely hypothetical.
This is true of both Bayesian and classical methods. We build models that would explain how different hypothetical levels of effectiveness would affect what data we should expect to see - that is the whole point. Classical methods also involve exploring scenarios in which purely hypothetical values of the parameter may be potentially true, and characterizing counterfactual samples that could have been drawn from them, even though in real life they couldn't have been.
reply