Hacker Newsnew | past | comments | ask | show | jobs | submit | imnotreallynew's commentslogin

Isn’t the legality of web scraping still..disputed?

There’s been a few projects I’ve wanted to work on involving scraping, but the idea that the entire thing could be shut down with legal threats seems to make some of the ideas infeasible.

It’s strange that OpenAI has created a ~$80B company (or whatever it is) using data gathered via scraping and as far as I’m aware there haven’t been any legal threats.

Was there some law that was passed that makes all web scraping legal or something?


Web scraping the public Internet is legal, at least in the U.S.

hiQ's public scraping of LinkedIn was ruled to be within their rights and not a violation of the CFAA. I imagine that's why LinkedIn has almost everything behind an auth wall now.

Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.

Some sites have terms at the bottom that prohibit scraping—but my understanding is that those aren't generally enforceable if the user doesn't have to take any action to accept or acknowledge them.


Most of these SaaS's have a "firehose" that if you are big enough (aka, can handle the firehose), can subscribe to. These are like RSS feeds on crack for their entire SaaS.

- https://developer.twitter.com/en/docs/twitter-api/enterprise...

- https://developer.wordpress.com/docs/firehose/


> Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.

They're legally enforceable in the sense that the scraped services generally reserve the right to terminate the authorizing account at will, or legally enforceable in that allowing someone to scrape you with your credentials (or scraping using someone else's) qualifies as violating the CFAA?


hiQ was found to be in violation of the User Agreement in the end.

Basically, in the end, it was essentially a breach of contract.


Exactly, that was my point.

hiQ's public scraping was found to be legal. It was the logged-in scraping that was the problem.

The logged-in scraping was a breach of contract, as you said.

The former is fine; the latter is not.

What OpenAI is doing here is the former, which companies are perfectly within their rights to do.


There’s currently only one situation where scraping is almost definitely “not legal”:

If the information you’re scraping requires a login, and if in order to get a login you have to agree to a terms of service, and that terms of service forbids you from scraping — then you could have a bad day in civil court if the website you’re scraping decides to sue you.

If the data is publicly accessible without a login then scraping is 99% safe with no legal issues, even if you ignore robots.txt. You might still end up in court if you found a way to correctly guess non-indexed URLs[0] but you’d probably prevail in the end (…probably).

The “purpose” of robots.txt is to let crawlers know what they can do without getting ip-banned by the website operator that they’re scraping. Generally crawlers that ignore robots.txt and also act more like robots than humans, will get an IP ban.

0: https://www.troyhunt.com/enumerationis-enumerating-resources...


Also worth noting there's a long history of companies with deep pockets getting away with murder (sometimes literally) because litigation in a system that costs money to engage with inherently favors the wealthier party.

Also OpenAI's entire business model is relying on generous interpretations of various IP laws, so I suspect they already have a mature legal division to handle these sorts of potential issues.


> Isn’t the legality of web scraping still..disputed?

Are you suggesting it might be illegal to... write a program that connects to a web server and asks for a specific page, and then parses that page to see which resources it wants and which other pages it links to, and treats those links in some special fashion, differently from the text content of the page?

Especially given that a web server can be configured to respond to any request with a "403 Forbidden" response, if the server determines for any reason whatsoever that it does not want to give the client the page it requested?


Why would it not be legal? Was there a law passed that makes it illegal?


The issue often isn't the scraping, it is often how you use the information scraped afterwards. A lot of scraping is done with no reference to any licensing information the sites being read might publish, hence image making AI models having regurgitated chunks of scraped stock images complete with watermarks. Though the scraping itself can count as a DoS if done aggressively enough.


Scraping publicly available data from websites is no different from web browsing, period. Companies stating otherwise in their T&Cs are a joke. Copyright infringement is a different game.


Scraping is legal. Always has been, always will be. Mainly because there's some fuzz around the edges of the definition. Is a web browser a scraper? It does a lot of the same things.

IIRC LinkedIn/Microsoft was trying to sue a company based on Computer Fraud and Abuse Act violations, claiming they were accessing information they were not allowed to. Courts ruled that that was bullshit. You can't put up a website and say "you can only look at this with your eyes". Recently-ish, they were found to be in violation of the User Agreement.

So as long as you don't have a user account with the site in question or the site does not have a User Agreement prohibiting scraping, you're golden.

The problem isn't the scraping anyway, it's the reproduction of the work. In that case, it really does matter how you acquired the material and what rights you have with regards use of that material.


The 9th Circuit Court of Appeals found that scraping publicly accessible content on the internet is legal.

If you publish something on a publicly served internet page, you're essentially broadcasting it to the world. You're putting something on a server which specifically communicates the bits and bytes of your media to the person requesting it without question.

You have every right to put whatever sort of barrier you'd like on the server, such as a sign in, a captcha, a puzzle, a cryptographic software key exchange mechanism, and so on. You could limit the access rights to people named Sam, requiring them to visit a particular real world address to provide notarized documentation confirming their identity in exchange for a unique 2fa fob and credentials for secure access (call it The Sams Club, maybe?)

If you don't put up a barrier, and you configure the server to deliver the content without restriction, or put your content on a server configured as such, then you are implicitly authorizing access to your content.

Little popups saying "by visiting this site, you agree to blah blah blah" are not valid. Courts made the analogy to a "gate-up/gate-down" mechanism. If you have a gate down, you can dictate the terms of engagement with your server and content. If you don't have a gate down, you're giving your content to whoever requests it.

You have control over the information you put online. You can choose which services and servers you upload to and interact with. Site operators and content producers can't decide that their intent or consent be withdrawn after the fact, as once something is published and served, the only restrictions on the scraper are how they use the information in turn.

Someone who's archived or scraped publicly served data can do whatever they want with the content within established legal boundaries. They can rewrite all the AP news articles with their own name as author, insert their name as the hero in all fanfic stories they download, and swap out every third word for "bubblegum" if they want. They just can't publish or serve that content, in turn, unless it meets the legal standards for fair use. Other exceptions to copyright apply, in educational, archival, performance, accessibility, and certain legal conditions such as First Sale doctrine. Personal use of such media is effectively unlimited.

The legality of web scraping is not disputed in the US. Other countries have some silly ideas about post-hoc "well that's not what I meant" legal mumbo jumbo designed to assist politicians and rich people in whitewashing their reputations and pulling information offline using legal threats.

Aside from right to be forgotten inanity, content on the internet falls under the same copyright rules as books, magazines, or movies published on physical media. If Disney set up a stall at San Francisco city hall with copies of the Avengers movies on a thumb drive in a giant box saying "free, take one!", this would be roughly the same as publishing those movie files to a public Disney web page. The gate would be up. (The way they have it set up in real life, with their streaming services and licensed media access, the gate is down.)

So - leaving behind the legality of redistribution of content, there's no restriction on web scraping public content, because the content was served intentionally to the software or entity that visited the site. It's up to the server operator to put barriers in place and to make content private. It's not rocket surgery, but platforms want to have their cake and eat it too, with control over publicly accessible content that isn't legal or practical.

Twitter/X is a good example of impractical control, since the site has effectively become useless spam without signing in. Platforms have to play by the same rules as everyone else. If the gate is up, the content is fair game for scraping. The Supreme Court gave the decision to a lower court, who affirmed the gate up/gate down test for legality of access to content.

Since Google and other major corporations have a vested interest in the internet remaining open and free, and their search engines and other tech are completely dependent on the gate up/gate down status quo, it's unlikely that the law will change any time soon.

Tl;dr: Anything publicly served is legal to scrape. Microsoft attempted to sue someone for scraping LinkedIn, but the 9th Circuit court ruled in favor of access. If Microsoft's lawyers and money can't impede scraping, it's likely nobody will ever mount an effective challenge, and the gating doctrine is effectively the law of the land.


> It’s strange that OpenAI has created a ~$80B company (or whatever it is) using data gathered via scraping

Like Google and many others.


Somewhat related, but does anyone know how to deal with a .us domain? There is no WHOIS privacy with that TLD. A few hours after I bought a .us domain, I started receiving phone calls from all over the world, literally every 5-10 minutes for days, offering services for my “new business” associated with the TLD.

Is there any way to keep legitimate contacts in the registrar without getting a ton of spam?


> Is there any way to keep legitimate contacts in the registrar without getting a ton of spam?

There is a ruling that prevents hiding the .us domain registration and contact information or using anonymizing proxy [1]

Your information must be public. Maybe you can try national do not call registry [2] and see if it helps in this particular case.

[1] https://www.washingtonpost.com/wp-dyn/articles/A7251-2005Mar...

[2] https://www.donotcall.gov


Isn’t python/django quite slow?


Not really. With caching it becomes quite fast. If you need extreme speed, look into a small go service or similar, but for most cases Django will be fast enough.


Slow compared to what? For what use cases?


Where was this?


Somewhere in Rajasthan.


Isn’t that how any Go project is put online?


The key distinction here seems to be that it uses an in-core database (SQLite), whereas a Go project may depend on an out-of-core database like Postgres, which would then also have to be deployed.

For many projects (especially hobby projects where downtime is tolerable), the former is probably quite sufficient.


There’s a lot of praise for Rust in these comments.

I always understood Rust to be for low level “close to the metal” sort of software. Is it at a point where it’s suitable for writing web applications?

I know it’s “possible” with frameworks like Rocket, but I’d like to know if Rust is at a point where it can compete in the web app space with Rails, Go, Node/Express, etc.


It sure can, but you can't be as sloppy as you can be with Ruby, Python or JavaScript or Go.

It's a high-level language, it just has its own rules that encourage correctness and prohibit sloppiness (which is a de-facto standard in the webdev industry, especially when prototyping rapidly by throwing shit at the wall and then letting whatever stuck live in production until it's no longer manageable).

For better or worse, Rust simply doesn't forgive a lot of things that are easy to do elsewhere, like not caring about less probable scenarios.

And, again, for better or worse, you also have to satisfy the borrow checker, where in other languages there's simply no such thing. Which is sometimes easy as calling clone() (not always a good idea), but sometimes can be quite a headache thinking about value lifetimes and how you just can't have something somewhere else (which can be super subtle so you wouldn't normally think about it in other languages with GC).


I’ve read opinions, perhaps presented as facts, that transforming commercial buildings into residential units was often more expensive than tearing down the whole building and starting from scratch.

If that’s true, there seems to be a large opportunity available to those who can figure out how to transform those buildings in a way cheaper than what’s currently possible.


I seriously doubt that it's possible, unless you can get the city/state to change the building codes dramatically for the worse. Residential units need various things that commercial buildings don't, like windows (for each unit, not just at the periphery of the huge building), lots of plumbing so everyone can have a kitchen and bathroom, etc. Retrofitting buildings (to be very different than before, not just changing some fixtures and keeping the overall design intact) is generally more expensive than just demolishing and building new ones that are properly designed for that purpose. Just look at cars: even restoring an old car costs a fortune in labor costs, because it's far less efficient than building a new one on an assembly line, and that isn't changing the car design substantially. Imagine what it would cost to transform a 1950s car into one that meets all 2023 safety and emissions standards; even if you can reuse an existing engine/transmission, there's a ton of custom work needed, both mechanical and electrical.


Or just tear them down and build something new.

Main issue with office space to apartment conversions is the deep floor plan that makes access to sunlight difficult. Retofitting plumbing and isolated fireproofing probably adds more complexity. In terms of malls though, it's perhaps less of a problem to turn these into low-rise residential.

But at the end of the day, it's just land waiting for re-use. I'd be happy for these spaces to become parkland and forest. Or maybe there will be something in the future that needs a lot of land but benefits being located somewhat closer to populations?


What sort of API? I’m working on another real estate project and were focusing on generic listings APIs. Would love to know your use case.


My research is in urban planning so I mostly use city/state/federal data and occasionally data from court cases.

Most of that data is sparsely updated and I'm not clued in on many real estate APIs to track up-to-date trends in housing and prices so I would love to know more about what your work looks like!


What? There are tons of API businesses. Many of them are massive companies.


I’m doing this now. Let me know if you need some recommendations or want to meet up (assuming you’re cool).

Definitely recommend it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: