Hacker Newsnew | past | comments | ask | show | jobs | submit | ComplexSystems's commentslogin

Absolutely. Who cares if the LLM automates some of the grunt work? Mathematicians are artists, and they paint with ideas. The goal is map out more of the beautiful structure of how things work. The enjoyment in it derives entirely from the payoff of seeing the larger view of how things fit together. If part of their process involves bouncing things off of other people, or even LLMs, I don't think it matters much, nor does it take away from the enjoyment in getting things figured out.

What makes this different from just kernel PCA with the quadratic kernel?

It's close but not the same. Kernel PCA lifts all D coordinates which gives M around 525k at D = 1024. In the post I do PCA first to reduce D to d = 256, then lift only those d coordinates, M = 33k. Much smaller, much faster Ridge solve.

That makes sense. If you could magically just get the top d PCs in quadratic kernel space without having to compute the whole kernel matrix, and then just do top-d quadratic PCs -> ridge, would that be better than doing the PCA -> top-d -> quadratic kernel ridge as you are now?

People can and do see unidentified things and take plenty of photos of them.

While I am sure FreeBSD is more secure than your average Linux distro, I sure hope they are using these new AI models to harden everything.

Good article, but

"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."

Why is this a "pseudo-probability distribution?"


Mathematically, it is literally a probability distribution, because it fits the definition of a measure whose total mass is one, so I think the language is just imprecise. What they may be trying to say is that semantically it doesn't arise in a principled way from an uncertainty model, such as from Bayesian or frequentist statistics.

Hogwash. If you get into deriving maximum entropy distributions via the calculus of variations, the multinomial is the maximum entropy distribution among categorical distributions.

This is exactly the sense that it comes up for old school LMs and why it appears in thermodynamics.

Of course it is entirely possible that newfangled ML people use it without understanding that it is derived from first principles - i.e. see article.


That definitely could be the case. I was also a bit surprised by what the article said, so I was simply trying to interpret it, but I'm not extremely well versed in ML so I could be missing some details. My main point was that contrary to what the article said, they do in fact have a probability distribution on their hands.

This is literally the probability distribution ML models are trained on.

https://docs.pytorch.org/docs/2.11/generated/torch.nn.CrossE...

You have a relatively small dictionary of tokens, each prediction has a neural network score that goes into the final token prediction layer, and they are trained based on a log-softmax (i.e. the above function) to predict their next token.

This is exactly how anyone in any field does conditional multinomial/categorical (i.e. one of a bunch of distinct tokens) distributions, and AFAIK what LLMs generally use as their loss functions on the output layer, though I have not deeply investigated all of them, since this has been how you do that since time immemorial.

I am extremely confused by all of the people screaming it's not a probability distribution?!?!?

I have seen computer vision tasks use binomial training objectives (one-vs-all) and then use the multinomial only at inference time, and that could be fair that that is not a probability distribution induced by training (while technically a probability distribution only in the sense it is \ge 0 and sums to 1).

But afaik token prediction LLMs that I am aware of use the softmax for the probability in their loss function, i.e. the maximize log softmax.


The comment in parenthesis mentions "they're not derived from a probability space" [1]. I don't know about probability spaces nor softmax to know what part of a probability space this is missing compared to other probability distributions, nor how other probability distributions satisfy probability spaces.

[1] https://en.wikipedia.org/wiki/Probability_space


Sounds like they're saying that since the distribution doesn't come from measuring or calculating the probability of something, it has the form of a probability distribution but isn't really one. Like saying 5 feet is a height that a person can have, but since I just made up that number it's not actually a person's height.

The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.

The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.


The softmax, after the network has been trained, yields an estimate of the probability in the training data, but it is not that probability itself.

Which models are not trained with the log softmax as the loss function?

Softmax isn't a loss function. It is used to transform model outputs into positive numbers that sum to 1, so that they can be interpreted as probabilities, and then those numbers are passed into (typically) the cross entropy loss function. I think you mean, which models are trained using some function other than softmax to transform the model outputs. There are a number of alternatives to softmax, such as the ones described here https://www.emergentmind.com/topics/sparsemax

The cross entropy loss function is softmax. They are one and the same.

They’re not. Cross entropy loss is E[-log q] where q is a probability. You could convert the model outputs x into probabilities using some other function like q = 1/Z x^2, and compute cross entropy loss just fine.


Behold the actual definition of cross entropy: https://en.wikipedia.org/wiki/Cross-entropy

It's true that the PyTorch API conflates cross entropy and softmax, but they are separate concepts.


iirc, there is a bunch of formal machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)

It seems like some kind of technique is needed that maximizes information transfer between huge LLM generated codebases and a human trying to make sense of them. Something beyond just deep diving into the codebase with no documentation.


Yes it does; you can build the absolute value as sqrt(x²), and sqrt(x) and x² are both constructible using eml.


This line of reasoning makes no sense when the AI can just be given access to a fuzzer. I would guess that it probably did have access to a fuzzer to put together some of these vulnerabilities.


> So the data cannot possibly tell you anything about how likely is the observed outcome, because the observed outcome is the only outcome that you observe.

This could also be viewed as supporting the Bayesian perspective, where the observed data are not viewed as random variables - they are fixed. This is because, as you say, the observed outcome is the only outcome that you observe. It is the classical setting, in comparison, where we instead do our analysis by treating the sample as a random variable, placing the counterfactual on other non-observed values ("what if I had drawn a different sample?"), even though we didn't. Bayesian methods treat the data as gospel truth, and place the counterfactual on the different parameters ("what if the population were different?"), even though it isn't.

The other criticism you have is

> The problem with this approach is that we can only observe ONE level of treatment effectiveness, i.e., the level of treatment effectiveness that the treatment actually possesses. All other possible levels of effectiveness are entirely hypothetical.

This is true of both Bayesian and classical methods. We build models that would explain how different hypothetical levels of effectiveness would affect what data we should expect to see - that is the whole point. Classical methods also involve exploring scenarios in which purely hypothetical values of the parameter may be potentially true, and characterizing counterfactual samples that could have been drawn from them, even though in real life they couldn't have been.


Statistical inference is based on random sampling. The data has to be random, otherwise it doesn't work.

I wrote another comment here clarifying my point, if you're interested: https://news.ycombinator.com/item?id=47566033


I found it surprising that this article persistently did not capitalize the word "Bayesian." Is this a new trend or something?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: