That’s not the only thing that matters. The provenance of the code also matters ...

red75prime · 2026-06-06T03:35:58 1780716958

> This is not a hypothetical scenario; I’ve personally encountered a case of someone using an LLM attempt to contribute code I recognized from a specific Open Source project under one license to another project under a different license

You say you "recognized code". Does it mean that you weren't able to find the exact match?

> an LLM is actually just regurgitating portions of its inputs

You seem to be talking about the inputs to the autoregressive pretraining stage. Correct? Then it's not how LLMs work, unless we use a definition of portions as a "few letters blocks."

eschaton · 2026-06-06T03:44:23 1780717463

I found exact matches. I also found inexact matches, where C functions had been turned into C++ member functions and the like. “Recognized” does not somehow imply a lack of precision.

The LLM the person used was trained on a very large corpus of Open Source code, and reproduced that code exactly. Just like LLMs have reproduced chapters of books and articles from the New York Times exactly.

red75prime · 2026-06-06T04:07:43 1780718863

> I found exact matches.

Were those functions trivial? With, say, 1% probability of someone who have not seen them writing them like that?

> Just like LLMs have reproduced chapters of books and articles from the New York Times exactly.

Have you read the articles? As far as I remember they fed large chunks of an article multiple times to an LLM to sometimes get a not-so-long exact match. It can mean that LLMs can infer a style and humans are predictable.

Topfi · 2026-06-06T11:01:33 1780743693

> […] fed large chunks of an article multiple times to an LLM […]

So they had to prompt? An LLM? I got this argument before and still don’t get what it’s trying to say. These models do not output anything unless prompted, that’s not any kind of gotcha.

On the code outputting front there is a lot of relevant evidence beyond the NYC lawsuit [0].

If I slightly modify GPL code, that doesn’t give me the right to relicense.

[0] https://arxiv.org/html/2601.02671?amp=&amp= and https://arxiv.org/abs/2506.12286 and https://ai.stanford.edu/blog/verbatim-memorization/

eschaton · 2026-06-06T04:11:49 1780719109

No, the functions weren’t trivial, and a lot of the surrounding code and structure bore substantial similarities as well. If you saw the two files next to each other, you’d assume it was the result of a copy-paste-adjust process if you didn’t know an LLM was involved.

red75prime · 2026-06-06T04:50:51 1780721451

I can only speculate that the model that generated the code hasn't undergone selective unlearning for verbatim data (SUV) or something similar. As you understand "sometimes generates verbatim code" and "just regurgitates [non-trivial] portions its input" are different statements.

The possibility of SUV clearly shows that a model does more than "just regurgitating."

matheusmoreira · 2026-06-05T21:28:41 1780694921

"LLM produced licensed code and person contributed it" is indistinguishable from "person contributed licensed code". The LLM is irrelevant. Result is the same as if they had copy pasted it.

eschaton · 2026-06-05T22:05:33 1780697133

Yes, exactly.

Unfortunately, a large number of people are being told—and here, you can see many who believe it—that the output of an LLM either carries no copyright or is copyright by the one prompting it. In other words, even right here on Hacker News it’s widely believed that LLMs “launder” copyright.

matheusmoreira · 2026-06-05T22:25:00 1780698300

Irrelevant either way. It's your name on the commit, and the code either infringes or it does not. Whether an LLM was used is immaterial.

eschaton · 2026-06-05T22:28:37 1780698517

Not irrelevant. A large number of people who would not copy and paste code from one project to the another will attempt to contribute the copyright-infringing output of an LLM and not think twice.

potsandpans · 2026-06-05T21:44:19 1780695859

The genie is out of the bottle here. If this were true then all fortune 500 companies would be pearl clutching and limiting their developers access to these tools.

But for better or worst I can assure you (for which you have no reason to believe me, just look at the headlines): nearly all tech companies are setting internal goals to have x% of code generated by llms by y date. And speaking as an insider, that x number is very large and that y date is very soon.

And before everyone continues to downvote me because I'm saying things that you don't want to hear, you have to realize that this is the world we live in now.

So, either you're right and the legal entities attached to some of the most powerful tech corporations have just decided to flaunt the law. Or you are missing something, or the game has changed.

Open source projects that want to hide behind provenance as a gate keeper to introduce llm generated code into their code base are going to get smoked.

There's nothing stopping a company like anthropic from funding an open source division that starts forking projects and accelerating the development. Expect 1000x more Buns.

There's nothing stopping an wealthy individual who wants to do that.

When the dust settles, no one is going to be worried about what you've typed here.

And if somehow the ip lawyers and capitalists won, then China will become the tech hub of the world.

Whether it's right or wrong, that is the reality.

eschaton · 2026-06-05T22:12:42 1780697562

The Fortune 10 company that I spent decades at and retired from just a couple years ago noticed this issue immediately and issued a blanket ban on the use of these tools for the company’s own code that to my knowledge has not been rescinded. (They also started developing their own coding-specific LLM, training solely on code they owned, around the same time.)

You might consider that there is a very large incentive by the large and public players in this market to promote the idea that this is not true, that they consider themselves large and powerful enough to actually flout the law, and that they plan to use the argument that enforcement will be too damaging to the economy to make their view the “new normal.”

This playbook has been run before, by Uber and Lyft, by AirBnB, by Tesla with “FSD,” and so on. It’s very clearly the approach being taken.

saagarjha · 2026-06-06T01:37:15 1780709835

They’re using Claude lmao

potsandpans · 2026-06-05T22:25:18 1780698318

Well, I've personally worked at 3 of the fortune 10s (two from pre llm mania days) and I know for a fact that they're full tilt, from keeping up with old colleagues, plus where I'm at currently.

I just looked at the list and I have friends that work at most with the exception of United, mkesson, Berkshire and cencora, so either you were at one of those or you're misinformed about your ex employer.

The entire industry for the most part is all in here.

We clearly disagree at an ideological level, for which I will not try to convince you my side is correct.

Instead, I would probably be willing to bet overall maybe 10k USD that your stance is generally not representative of where we end up in 5 years.

Let's make a Polymarket and compete with dollars instead of words (slightly in jest)

eschaton · 2026-06-05T22:30:27 1780698627

Or you’re misinformed about what my old employer is actually doing, or how they’re doing it.

potsandpans · 2026-06-05T23:12:56 1780701176

I'm not

archagon · 2026-06-05T22:57:42 1780700262

Is this comment LLM generated?

Have fun with 1000x more Buns that literally no one is using or maintaining. An entire software industry built on top of a burning garbage pile of crappy, dead code.

elnatro · 2026-06-06T06:48:02 1780728482

It is, that user has responded me using LLMs before…

int_19h · 2026-06-06T01:39:20 1780709960

> An entire software industry built on top of a burning garbage pile of crappy, dead code.

That has been the case for the last, oh, decade or so. Where do you think LLMs learned to slop code?

archagon · 2026-06-06T01:41:57 1780710117

Things have been bad, but every company using its own bespoke LLM reimplementation of rsync and similar is so, so much worse.

int_19h · 2026-06-06T02:15:10 1780712110

Why would every company do it though? They'll just all be using the same (Anthropic's) AI-enabled fork.

archagon · 2026-06-06T02:48:40 1780714120

You think Anthropic wants to be the sole maintainer of thousands of forked OSS projects...? I seriously doubt that would happen, for legal, marketing, and logistical reasons alike.

int_19h · 2026-06-06T07:12:54 1780729974

Anthropic, probably not. I could totally see Altman or even Musk deciding to do that exact thing as a showcase of sorts.

potsandpans · 2026-06-06T00:31:53 1780705913

> Is this comment LLM generated?

No, you might be experiencing online psychosis. No longer able to distinguish between generated text and things you don't agree with.

archagon · 2026-06-06T00:49:28 1780706968

It just reads like Linkedin slop. One melodramatic sentence after another.

Consider collecting related thoughts into paragraphs.

potsandpans · 2026-06-06T00:55:45 1780707345

Oh right, you're the cdrt research guy. I don't give a fuck about what you think about my thought process or writing style.