Hacker Newsnew | past | comments | ask | show | jobs | submit | simiones's commentslogin

To be fair to GP, while I agree that literary translations are still much better left to professional translators, the specific examples actually given have recently been moving in the opposite direction in my experience - at least for translations from English. In my own language, I've seen articles published in translation under license, on major local news sites, including mis-translations as ridiculous as "1000 lb bombs" translated as "1000 £ bombs" ("bombe de 1000 de lire").

News sites are extremely cash strapped, I bet it's all automatic translation, maybe with the exception of a few important articles, and it has been even before LLMs.

On my local news sites I still see crap like english word order but Romanian words on filler articles, which means it wasn't even a LLM.


During the Ukraine war I saw many Russian or Ukrainian language articles or whatever translated with just Google Translate and it was a hopeless jumble of errors.

Even GPT 4 was massively better.

Some people just don’t understand how general purpose these chat bots are and insist on continuing to use single purpose tools that have been left in the dust.


It makes no sense from a classical economics perspective to keep the theater empty. Even if no one is buying at the break-even price, it can make sense to sell below cost just to recoup some of the investment - and adjust the investment in the future, of course.

Now, in reality there are second-hand effects, of course - like people getting adjusted to the below-cost ticket prices and being even less incentivized to buy at the normal price.


I guess it depends on who gets paid for the movie being shown, and who gets paid when a ticket is sold.

If it is free to show the movie then there is no penalty to running extra sessions. If it isn't free, someone is being paid. If that is a different someone to where ticket money goes they care more about sessions than viewings.


Seems the same as a hotel - empty rooms benefit no one.

I don't know about theatres, but I do know about hotel rooms.

If you lower the price too much, you get a different sort of clientele. The sort of person who wrecks the place and annoys all the other patrons nearby.

Then the cleanup costs a lot. Often more than the amount of revenue collected on the room.

It absolutely makes more sense to keep the hotel room empty than to lower the price to keep it fully occupied.


> I'm glad they didn't go with the idiotic Go approach ("every path is a valid UTF-8 string" or we just garble the path at the standard library level")

Can you expound a bit on this? I haven't been able to find any articles related to this kind of problem. It's also a bit surprising, given that Go specifically did not make the same choice as Rust to make strings be Unicode / UTF-8 (Go strings are just arrays of bytes, with one minor exception related to iteration using the range syntax).


Go's docs put it like this: Path names are UTF-8-encoded, unrooted, slash-separated sequences of path elements, like “x/y/z”. If you operate on a path that's a non-UTF-8 string, then Go will do... something to make the string work with UTF-8 when passed back to standard file methods, but it likely won't end up operating on the same file.

Rust has OsStr to represent strings like paths, with a lossy/fallible conversion step instead.

Go's approach is fine for 99% of cases, and you're pretty screwed if your application falls for the 1% issue. Go has a lot of those decisions, often to simplify the standard library for most use cases most people usually run into (like their awful, lossy, incomplete conversion between Unix and Windows when it comes to permissions/read-only flags/etc.).


> Path names are UTF-8-encoded, unrooted, slash-separated sequences of path elements, like “x/y/z”

This is only for the "io/fs" package and its generic filesystem abstractions. The "os" package, which always operates on the real filesystem, doesn't actually specify how paths are encoded, nor does its associated helper package "path/filepath".

In practice, non-UTF-8 already wasn't an issue on Unix-like systems, where file paths are natively just byte sequences. You do need to be aware of this possibility to avoid mangling the paths yourself, though. The real problem was Windows, where paths are actually WTF-16, i.e. UTF-16 with unpaired surrogates. Go has addressed this issue by accepting WTF-8 paths since Go 1.21: https://github.com/golang/go/issues/32334#issuecomment-15500...


The `os` package, that is the main way everyone I've seen opens and reads files in Go, doesn't specify any restriction on its path syntax (except that it uses `string`, of course). I've tried using it on Linux with a file name that would be invalid UTF-8 and it works without any issues.

I for one hadn't even heard of the io/fs package that has the problems that you mention, and I don't remember ever seeing it used in an example. I've looked in a code base I help maintain, and the only uses I could find are related to some function type definitions that are used by filepath.WalkDir and filepath.Walk - and those functions explicitly document the fact that they don't use `io/fs` style paths when calling these functions - they don't even respect the path separator format:

  // WalkDir calls fn with paths that use the separator character appropriate
  // for the operating system. This is unlike [io/fs.WalkDir], which always
  // uses slash separated paths.
  func WalkDir(root string, fn fs.WalkDirFunc) error {
Where fs.WalkDirFunc is defined like this:

  type WalkDirFunc func(path string, d DirEntry, err error) error

> Go strings are just arrays of bytes,

https://go.dev/ref/spec#String_types: “A string value is a (possibly empty) sequence of bytes”

https://pkg.go.dev/strings@go1.26.2: “Package strings implements simple functions to manipulate UTF-8 encoded strings.”

So, yes, Go strings are just arrays of bytes in the language, but in the standard library, they’re supposed to be UTF-8 (the documentation isn’t immediately clear on how it handles non-UTF-8 strings).

I think this may be why the OP thinks the Go approach is “every path is a valid UTF-8 string”


A policy like this has two points. One, to give good faith potential contributors a guideline on what the project expects. Two, to help reviewers have a clear policy they can point to to reject AI slop PRs, without feeling bad or getting into conflicts about minutiae of the code.


Right, "good faith" is a key idea that is being ignored. If you want to lie to the lead SDL maintainers and claim your code is 100% human-written, you can probably get away with it. But that is unethical and cynical behavior in pursuit of an astonishingly petty goal. And it's correct for SDL to simply ignore the contribution because it came from a dishonest developer, even if the specific code appears to be very good.


Per Wikipedia, Down's syndrome currently occurs in ~1 in 1000 live births, and used to occur in 2 in 1000 live births some decades ago, in the USA. That means that a test with a 1% false positive rate (99% accuracy) will lead to a false positive for 98-99 healthy embryos per 1000 live births. I would say that this is fair to call "not all that accurate".

Note: I am not in anyway saying that this means that people shouldn't trust the tests, or anything like that. Just reminding everyone that a test's accuracy has to be compared to the incidence of the disease to decide if it's high or not.


> lead to a false positive for 98-99 healthy embryos per 1000 live births.

The number you’re looking for is 9, not 99


Oops... Off by one [order of magnitude]...


Screening ‘test’ vs diagnostic ‘test’ is an important concept.

Screening tests are designed for sensitivity — false positives are expected and identify who would benefit from additional diagnostic tool and procedures.


Caster Semenya is a woman, not sure why you're referring to her as him. The fact that she has a potentially unfair advantage due to her unusual genetics in women's competitions doesn't in any way make it fair to refer to her in this way.


If you look at accounts from Semenya's early life there is evidence against his account of growing up as a girl. For example, there have been school photos published showing him wearing a boy's uniform near to a group of girls who were all wearing girl's uniforms. His former school headmaster, when interviewed years later, said he thought that Semenya was a boy and was very surprised to hear that he was now competing in women's athletics.

And of course he would have gone through male puberty, not female puberty. This would have been obvious then, and the result of this is obvious now if you see him in interviews. Male-typical build, male-typical vocal tone. Even his now-wife assumed (correctly) that he is male when she first met him.

Semenya has to double down on this narrative that he is a woman otherwise he will have to admit that his successful sporting career as a woman will have been a lie.


Even if you believe that it is the case that she lived her early life as a male, at the point that a person has made it clear that they have some preferred pronoun/is trans would it not just be disrespectful to intentionally refer to them counter to that?


If I had chosen to refer to Semenya using pronouns that imply he is female, that would have conflicted with the points I was making.


I don't know the specifics in this case, but they can be biologically male and use the female gender. How would that conflict your point?


I think this is a bit too broad. There are actually three possible cases.

When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.

If the code is completely different, then clean room or not is indeed irrelevant. The only way the author can claim that you violated their copyright despite no apparent similarity is for them to have proof you followed some kind of mechanical process for generating the new code based on the old one, such as using an LLM with the old code as input prompt (TBD, completely unsettled: what if the old code is part of the training set, but was not part of the input?) - the burden of proof is on them to show that the dissimilarity is only apparent.

In realistic cases, you will have a mix of similar and dissimilar portions, and portions where the similarity is questionable. Each of these will need to be analyzed separately - and it's very likely that all the similar portions will need to be re-written again if you can't prove that they were not copied directly or from memory from the original, even if they represent a very small part of the work overall. Even if you wrote a 10k page book, if you copied one whole page verbatim from another book, you will be liable for that page, and the author may force you to take it out.


> When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.

Yes, but you do not have to prove that you haven’t copied the original; you have to prove you didn’t infringe copyright. For that there are other possible defenses, for example:

- fair use

- claiming the copied part doesn’t require creativity

- arguing that the copied code was written by AI (there’s jurisdiction that says AI-generated art can’t be copyrighted (https://www.theverge.com/2023/8/19/23838458/ai-generated-art...). It’s not impossible judges will make similar judgments for AI-generated programs)


Courts have ruled that you can't assign copyrights to a machine, because only humans qualify for human rights. ** There is not currently a legal consensus on whether or not the humans using AI tools are creating derivative works when they use AI models to create things.

** this case is similar to an old case where a ~~photographer~~ PETA claimed a monkey owned a copyright to a photo, because they said a monkey took the photo completely on their own. The court said "okay well, it's public domain then because only humans can have copyrights"

Imagine you put a harry potter book in a copy machine. It is correct that the copy machine would not have a copyright to the output. But you would still be violating copyright by distributing the output.


https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput... Specifically he claimed he owned the copyright on a photo he didn't directly take. PETA weighed in trying to say the monkey owned the copyright.


Ah yeah you’re right I forgot it was PETA arguing that.


> there’s jurisdiction that says AI-generated art can’t be copyrighted

The headline was misleading. The courts said what Thaler could have copyrighted was a complicated question they ignored because he said he was not the author.


- Arguing that you owned the copyright on the copied code (the author here has apparently been the sole maintainer of this library since 2013, not all, but a lot of the code that could be copied here probably already belongs to him...)


The burden of proof is completely uncharted when it comes to LLMs. Burden of proof is assigned by court precedent, not the Copyright Act itself (in US law). Meaning, a court looking at a case like this could (should) see the use of an LLM trained on the copyrighted work as a distinguishing factor that shifts the burden to the defense. As a matter of public policy, it's not great if infringers can use the poor accountability properties of LLMs to hide from the consequences of illegally redistributing copyrighted works.


The way I see this it looks like this:

1. Initially, when you claim that someone has violated your copyright, the burden is on you to make a convincing claim on why the work represents a copy or derivative of your work.

2. If the work doesn't obviously resemble your original, which is the case here, then the burden is still on you to prove that either

(a), it is actually very similar in some fundamental way that makes it a derived work, such as being a translation or a summary of your work

or (b), it was produced following some kind of mechanical process and is not a result of the original human creativity of its authors

Now, in regards to item 2b, there are two possible uses of LLMs that are fundamentally different.

One is actually very clear cut: if I give an LLM a prompt consisting of the original work + a request to create a new work, then the new work is quite clearly a derived work of the original, just as much as a zip file of a work is a derived work.

The other is very much not yet settled: if I give an LLM a prompt asking for it to produce a piece of code that achieves the same goal as the original work, and the LLM had in its training set the original work, is the output of the LLM a derived work of the original (and possibly of other parts of the training set)? Of course, we'll only consider the case where the output doesn't resemble the original in any obvious way (i.e. the LLM is not producing a verbatim copy from memory). This question is novel, and I believe it is being currently tested in court for some cases, such as the NYT's case against OpenAI.


On the other hand, as a matter of public policy, nobody should be able to claim copyright protection for the process of detecting whether a string is correctly formed unicode using code that in no material way resembles the original. This is not rocket science.


> IMO this is pretty common sense. No one's arguing they're authoring generated code; the whole point is to not author it.

Actually this is very much how people think for code.

Consider the following consequence. Say I work for a company. Every time I generate some code with Claude, I keep a copy of said code. Once the full code is tested and released, I throw away any code that was not working well. Now I leave the company and approach their competitor. I provide all of the working code generated by Claude to the competitor. Per the new ruling, this should be perfectly legal, as this generated code is not copyrightable and thus doesn't belong to anyone.


No software company thinks this, not Oracle, not Google, not Meta, no one. See: the guy they sued for taking things to Uber.


The person I replied to said "No one's arguing they're authoring generated code; the whole point is to not author it.". My point was that people absolutely do think and believe strongly they are authoring code when they are generating it with AI - and thus they are claiming ownership rights over it.


(the person you originally replied to is also me, tl;dr: I think engineers don't think they're authoring, but companies do)

The core feature of generative AI is the human isn't the author of the output. Authoring something and generating something with generative AI aren't equivalent processes; you know this because if you try and get a person who's fully on board w/ generative AI to not use it, they will argue the old process isn't the same as the new process and they don't want to go back. The actual output is irrelevant; authorship is a process.

But, to your point, I think you're right: companies super think their engineers have the rights to the output they assign to them. If it wasn't clear before it's clear now: engineers shouldn't be passing off generated output as authored output. They have to have the right to assign the totality of their output to their employer (same as using MIT code or whatever), so that it ultimately belongs to them or they have a valid license to use it. If they break that agreement, they break their contract with the company.


(oops, I didn't check the usernames properly, sorry about that)

I still don't think this is fully accurate.

The view I'm noticing is that people consider that they have a right to the programs they produce, regardless of whether they are writing them by hand or by prompting an LLM in the right ways to produce that output. And this remains true both for work produced as an employee/company owner, and for code contributed to an OSS project.

Also, as an employee, the relationship is very different. I am hired to produce solutions to problems my company wants resolved. This may imply writing code, finding OSS code, finding commercial code that we can acquire, or generating code. As part of my contract, I relinquish any rights I may have to any of this code to the company, and of course I commit to not use any code without a valid license. However, if some of the code I produce for the company is not copyrightable at all, that is not in any way in breach of my contract - as long as the company is aware of how the code is produced and I'm not trying to deceive them, of course.

In practice, at least in my company, there has been a legal analysis and the company has vetted a certain suite of AI tools for use for code generation. Using any other AI tools is not allowed, and would be a breach of contract, but using the approved ones is 100% allowed. And I can guarantee you that our lawyers would assert copyright to any of the code generated in this way if I was to try to publish it or anything of the kind.


Every contract I've seen has some clause where the employee affirms they have the right to assign the rights to their output (code, etc) to the company.

I'm not really convinced; I think if I vibe code an app, and you vibe code an app that's very, very similar, and we're both AI believers, we probably both go "yup, AI is amazing; copyright is useless." You know this because people are actively trying to essentially un-GPL things with vibe coding. That's not authoring, that's laundering, and people only barely argue about it. See: this chardet situation, where the guy was like "I'm intimately familiar with the codebase, I guided the LLM, and I used GPL code (tests and API definitions, which are all under copyright) to ensure the new implementation behaved very similarly to the old one." Anything in the new codebase is either GPL'd or LLM generated, which according to the copyright office, isn't copyrightable. If he's right, nothing prevents me from doing the exact same thing to make a new public domain chardet. It's facially absurd.


The copyright argument is the only relevant argument. If the new work is a derived work of the original, then it follows by definition that the new work is under the copryight of the original's author(s). Since the original chardet was distributed by its author(s) only under the LGPL, any copy/derivative of it that anyone else creates must be distributed only under the LGPL, per the terms of the LGPL.

Now, whether chardet 7.0.0 is a derivative of chardet or not is a matter of copyright law that the LGPL has no say on, and a rather murky ground with not that much case law to rely on behind it. If it's not, the new author is free to distribute chardet 7.0.0 under any license they want, since it is a new work under his copyright.


Producing a copy of a copyrighted work through a purely mechanical process is clear violation of copyright. LLMs are absolutely not different from a copier machine in the eyes of the law.

Original works can only be produced by a human being, by definition in copyright law. Any artifact produced by an animal, a mechanical process, a machine, a natural phenomenon etc is either a derived work if it started from an original copyrighted work, or a public domain artifact not covered by copyright law if it didn't.

For example, an image created on a rock struck by lightning is not a copyright covered work. Similarly, an image generated by an diffusion model from a randomly generated sentence is not a copyrightable work. However, if you feed a novel as a prompt to an LLM and ask for a summary, the resulting summary is a derived work of said novel, and it falls under the copyright of the novel's owner - you are not allowed to distribute copies of the summary the LLM generated for you.

Whether the output of an LLM, or the LLM weights themselves, might be considered derived works of the training set of that LLM is a completely different discussion, and one that has not yet been settled in court.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: