Hacker Newsnew | past | comments | ask | show | jobs | submit | thresh's commentslogin

We had to set it up on the parts of VideoLAN infra so the service would remain usable.

Otherwise it was under a constant DDoS by the AI bots.


Maybe I’m naive about this, but I didn’t expect AI scrapers to be that big of a load? I mean, it’s not that they need to scrape the same at 1000+ QPS, and even then I wouldn’t expect them to download all media and images either?

What am I missing that explains the gap between this and “constant DDoS” of the site?


You cant really cache the dynamic content produced by the forges like Gitlab and, say, web forums like phpbb. So it means every request gets through the slow path. Media/JS is of course cached on the edge, so it's not an issue.

Even when the amount of AI requests isnt that high - generally it's in hundreds per second tops for our services combined - that's still a load that causes issues for legitimate users/developers. We've seen it grow from somewhat reasonable to pretty much being 99% of responses we serve.

Can it be solved by throwing more hardware at the problem? Sure. But it's not sustainable, and the reasonable approach in our case is to filter off the parasitic traffic.


Thanks, appreciate the details. 99% is far above the amount I expected, and if it specifically hits hard to cache data then I can see how that brings a system to its knees.

You kind of can though. You serve cached assets and then use JavaScript to modify it for the individual user. The specific user actions can't be cached, but the rest of it can.

Totally. Remember slashdot in the 1990s used to house a dynamic page on a handful of servers with horsepower dwarfed by a Nintendo Switch that had a user base capable of bringing major properties down.

The "can't" comes from the fact that VLC is not going to rewrite their forum software or software forge.

Software written in PHP is in most cases frankly still abysmally slow and inefficient. Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better. Pathetic throughput at best and it has not gotten better in decades now.

I don't know how GitLab became so disgustingly slow. But yeah, I'm not surprised bots can easily bring it to its knees.


> Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better.

At least phpBB died 15 years ago with most communities migrating to Xenforo. I'm not quite sure how or why WP is still around with so many SSGs and SaaS site builders floating around these days.


Xenforo is not much better an has many "administrators" whining about bot traffic as well.

The funniest part about WordPress is that you can usually achieve at least a 50% speed boost or more by adding a plugin that just minifies and caches the ridiculous number of dynamic CSS and JS files that most themes and plugins add to every page. Set those up with HTTP 103 Early Hints preload headers (so the browser can start sending subresource requests in the background before the HTML is even sent out, exactly the kind of thing HTTP/2 and /3 were designed to make possible) and then throw Cloudflare or another decent CDN on top, and you're suddenly getting TTFBs much closer to a more "modern" stack.

The bizarre thing is that pretty much no CMS, even the "new" ones, seems to automate all of that by default. None of those steps are that difficult to implement, and provide a serious speed boost to everything from WordPress to MediaWiki in my experience, and yet the only service that seems to get close to offering it is Cloudflare.

Even then, Cloudflare's tooling only works its best if you're already emitting minified and compressed files and custom written preload headers on the origin side, since the hit on decompressing all the origin traffic to make those adjustments and analyses is way worse for performance than just forwarding your compressed responses directly, hence why they removed Auto Minify[1] and encourage sending pre-compressed Brotli level 11 responses from the origin[2] so people on recent browsers get pass-through compression without extra cycles being spent on Cloudflare's servers.

The solution seems pretty clear: aim to get as much stuff served statically, preferably pre-compressed, as you can. But it's still weird that actually implementing that is still a manual process on most CMSes, when it shouldn't be that hard to make it a standard feature.

And as for Git web interfaces, the correct solution is to require logins to view complete history. Nobody likes saying it, nobody likes hearing it. But Git is not efficient enough on its own to handle the constant bombardment of random history paginations and diffs that AI crawlers seem to love. It wasn't an issue before, because old crawlers for things like search engines were smart enough to ignore those types of pages, or at least to accept when the sysadmin says it should ignore those types of pages. AI crawlers have no limits, ignore signals from site operators, make no attempts to skip redundant content, and in general are very dumb about how they send requests (this is a large part of why Anubis works so well; it's not a particularly complex or hard to bypass proof of work system[3], but AI bots genuinely don't care about anything but consuming as many HTTP 200s as a server can return, and give up at the slightest hint of pushback (but do at least try randomizing IPs and User-Agents, since those are effectively zero-cost to attempt).

[1]: https://community.cloudflare.com/t/deprecating-auto-minify/6...

[2]: https://blog.cloudflare.com/this-is-brotli-from-origin/

[3]: https://lock.cmpxchg8b.com/anubis.html but see also https://news.ycombinator.com/item?id=45787775 and then https://news.ycombinator.com/item?id=43668433 and https://news.ycombinator.com/item?id=43864108 for how it's working in the real world. Clearly Anubis actually does work, given testimonials from admins and wide deployment numbers, but that can only mean that AI scrapers aren't actually implementing effective bypass measures. Which does seem pretty in line with what I've heard about AI scrapers, summarized well in https://news.ycombinator.com/item?id=43397361, in that they are basically making no attempt to actually optimize how they're crawling. The general consensus seems to be that if they were going to crawl optimally, they'd just pull down a copy of Common Crawl like every other major data analysis project has done for the last two decades, but all the AI companies are so desperate to get just slightly more training data than their competitors that they're repeatedly crawling near-identical Git diffs just on the off-chance they reveal some slightly different permutation of text to use. This is also why open source models have been able to almost keep pace with the state of the art models coming out of the big firms: they're just designing way more efficient training processes, while the big guys are desperately throwing hardware and crawlers at the problem in the desperate hope that they can will it into an Amazon model instead of a Ben and Jerry’s model[4].

[4]: https://www.joelonsoftware.com/2000/05/12/strategy-letter-i-... - still probably the single greatest blog post ever written, 26 years later.


> And as for Git web interfaces, the correct solution is to require logins to view complete history.

Why logins, exactly? Who would have such logins; developers only, or anyone who signs up? I'm not sure if this is an effective long-term mitigation, or simply a “wall of minimal height” like you point out that Anubis is.


I think there's a few things at play here

- AI scrapers will pull a bunch of docs from many sites in parallel (so instead of a human request where someone picks a single Google result, it hits a bunch of sites)

- AI will crawl the site looking for the correct answer which may hit a handful of pages

- AI sends requests in quick succession (big bursts instead of small trickle over longer time)

- Personal assistants may crawl the site repeatedly scraping everything (we saw a fair bit of this at work, they announced themselves with user agents)

- At work (b2b SaaS webapp) we also found that the personal assistant variety tended to hammer really computationally expensive data export and reporting endpoints generally without filters. While our app technically supported it, it was very inorganic traffic

That said, I don't think the solution is blanket blocks. Really it's exposing sites are poorly optimized for emerging technology.


Also, relevant for forges: AI doesn't understand what it's clicking on. Git forges tend to e.g. have a lot of links like “download a tarball at this revision” which are super-expensive as far as resources go, and AI crawlers will click on those because they click on every link that looks shiny. (And there are a lot of revisions in a project like VLC!) Much, much more often than humans do.

This is also irrelevant to the original comment which is complaining about bot checks for looking at the root of the repositiory - which is probably the highest requested resource and should be 100% served from cache with a cost much less than running the bot checks.

It's simply bad, inefficient software and we shouldn't keep making excuses for it.


Agree. Did some basic searching and looks like Gitlab is particularly bad. It ships with built in rate-limiting but the backend marks all pages as uncacheable on top of them being somewhat dynamically generated (I guess it caches "page fragments").

The only issues I found amounted to "here's how to use Anubis to block everything"

There's also some new but poorly supported standards around agents setting `Accept: text/markdown` and https://github.com/cloudflare/web-bot-auth


They are a scourge, they never rate-limit themselves, there are a hundred of them, and a significant number don’t respect robots.txt. Many of them also end up our meta:no-index,no-follow search pages leading to cost overruns on our Algolia usage. We spend way too much time adjusting WAF and other bot-controls than we should have.


Thanks. I imagine there is a (a) a lot of interest in scraping source code, and (b) many requests to forges hitting expensive paths. 99% of volume though, wow, much more than expected.

You've gotten several comprehensive responses so far and I want to add a niche corner that people might assume might not have the bot problem but still does.

I run a website that hosts tools for my family: games and a TV interface for the kids, remote access to our family cloud and cameras, etc. Sensitive things require log in and have additional parameters required for access of course.

I specifically blocked bots from search engines so my site is never indexed, as I'm not selling anything nor want any attention, as well as some other public non-malicious bots in case they communicate with Google, just to be safe there, and my robots.txt doesn't allow anything.

I assume then, that the only way a bot could even find my site is to do what the indexers do: brute force try every single possible ipv4 address hoping to hear something back, as my domain should not be known (and isn't simple enough to be quickly guessed), and most traffic must be malicious, or indexing (AI overview and other scrapers won't be finding it via web search).

Since it isn't indexing, and keeping everything in simple black and white boxes, my remaining traffic is family or malicious bots, and 99.9% isn't family.

I currently have the most strict bot-blocking setup I could come up with, which nicely cut down on quite a bit of traffic, but I do still receive ~2k attempts per day, which as you can imagine, still is around 99% not traffic, as I have fewer than 20 kids, and my kids aren't using the site nonstop.

Conveniently, my setup has never accidentally blocked a family member, so I'm pleased with the setup.


> I assume then, that the only way a bot could even find my site is to do what the indexers do: brute force try every single possible ipv4 address hoping to hear something back, as my domain should not be known

If your site uses https, they could also get your domain from the certificate transparency logs for the certificate you use.


I didn't think of that, but that makes complete sense, as it is https. I think my info was sold by my registrar as well because solicitors call or email me on occassion because they "accidentally came across my site" and want to provide the design/js/etc help.

You can get around this by grabbing a wildcard certificate and then using a hard-to-guess subdomain.

While I do sympathetize with the AI DDoS situation, it'd be nice if there were a solution that allows them to work so they can pull official docs.

For instance, MCP, static sites that are easy to scale, a cache in front of a dynamic site engine


Of course, static websites is the best solution to that problem.

Our documentation and a main website are not fronted by this protection, so they're still accessible for the scrapers.


Have you considered making the pages mostly static and cacheable for non logged in users first. There is no reason a repository listing needs to be this resource intensive.

I highly doubt there is no other technically feasible option to block the AI bots. You end up blocking not just bots, but many humans too. When I clicked on the link and the bot block came up, I just clicked back. I think HN posts should have warnings when the site blocks you from seeing it until you somehow, maybe, prove you are human.

I'm sure there are many solutions for many problems, but expecting a small Foss development team to know or implement them all is rather unreasonable.

I think the world gains more if the VLAN team focuses on their amazing, free contribution to the world, than if they spend the same time trying to figure out how to save you two clicks.

We all hate that this is happening, but you don't need to attack everyone that is unfortunately caught up in it.


> I highly doubt there is no other technically feasible option to block the AI bots.

If you have discovered such an option, you could get very wealthy: minimizing friction for humans in e-commerce is valuable. If you're a drive-by critic not vested in the project, then yours is an instance of talk being cheap.


I'm all ears on how we can fix it otherwise.

Keep in mind that those kinds of services: - should not be MITMed by CDNs - are generally ran by volunteers with zero budget, money and time-wise


First off, don't block the first connection of the day from a given IP. Rate limit/block from there, for example how sshguard does it.

I've seen several posts on HN and elsewhere showing many bots can be fingerprinted and blocked based on HTTP headers and TLS.

For the bots that perfectly match the fingerprint of an interactive browser and don't trigger rate limits, use hidden links to tarpits and zip bombs. Many of these have been discussed on HN. Here's the first one that came to memory: https://news.ycombinator.com/item?id=42725147


> don't block the first connection of the day from a given IP.

The bots come from a huge number of IP addresses, that won't really help that much. And it doesn't solve the UX problem either, because most pages require multiple requests for additional assets, and requiring human verification then is a lot more complicated than for the initial request.

> For the bots that perfectly match the fingerprint of an interactive browser

That requires properly fingerprinting the browser, which will almost certainly have false positives from users who use unusual browsers or use anti-fingerprinting.

> use hidden links to tarpits and zip bombs.

That can waste the bot users resources, but doesn't necessarily protect your site from bots. And Also requires quite a bit of work that small projects don't have time for.

Unless there is a prebuilt solution that is at least as easy to deploy and at least as effective as something like anubis, it isn't really practical for most sites.


Just use the official packages from https://nginx.org/en/linux_packages.html#Ubuntu

nginx-module-acme is available there, too, so you don't need to compile it manually.


I'm currently working on getting the Trixie packages uploaded. It'll be there this week.

As you've said Debian 13 was released 4 days ago - it takes some time to spin up the infrastructure for a new OS (and we've been busy with other tasks, like getting nginx-acme and 1.29.1 out).

(I work for F5)


Is that correct? My understanding is that the SWIFT transfer will go through a US correspondent bank in this case.


Not necessarily. A correspondent bank is simply a bank where the recipient bank has an account in the specified currency. For domestic currencies used only in certain countries, that bank will almost always be in that country. For currencies used globally - not necessarily.


Uh, Siberia and Russian European North.


Hi!

The official nginx docker images ship with HTTP3 module enabled - and we have released the updated ones earlier today - so please update to stay secure.

You can also launch something like: $ docker run -ti --rm nginx:latest nginx -V

to check which modules are compiled in to the binary you're running.

Thanks!


Why are they enabled by default while this page says otherwise?: https://my.f5.com/manage/s/article/K000138444

  Note: The HTTP/3 QUIC module is not enabled by default and is considered experimental.


Because they are not so "official" maybe?


VLC has a port to WebAssembly: https://code.videolan.org/jbk/vlc.js#vlcjs-vlc-for-wasmasmjs

(it's actually usable to some extent)




> We will also start asking anyone who reports bugs in AlmaLinux OS to attempt to test and replicate the problem in CentOS Stream as well

So what's the point of running Alma then?


- Provide a maintenance period beyond what CentOS Stream provides.

- Potentially hold back CentOS Stream updates to stay more in sync with RHEL. If RHEL is on X.Y, CentOS Stream is technically on X.(Y+1) - Alma wants to be X.Y compatible.

- Provide ABI compatibility with RHEL which CentOS Stream may not provide.


* Provide ABI compatibility with RHEL which CentOS Stream may not provide.*

CentOS Stream by definition is ABI compatible with RHEL, and is required to be as such.


Buffer the stream.


Don't cross the streams? ;)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: