"Hey we have this gazillion flop/s available, shall we try to put it to some good use?" - "Nah, let's just waste them recovering information that was lost because it's so much simpler to share a screenshot instead of text or even html."
Sometimes, I wonder if the hardware industry sponsors a certain sunset of software developers to make sure every possible performance gain is immediately wasted. Electron was a strong hint, but this is a whole new level...
I think they actually have a point. Parsing HTML isn’t trivial: aside from bad/invalid HTML (think: missing opening/closing tags, quotes, etc), there’s also a lot of content that requires javascript to render in the first place, for example, which means the page needs to be rendered and have access to window and DOM, etc.
Then, of course, you have embedded objects, such as iframes, or that aren’t text that traditional parsing can’t easily identify/extract, even if everything is rendered OK. For example, a video, animations, interactivity, etc. Another contrived example which illustrates the point would be a site that uses images for buttons/links rather than text, or even non-semantic HTML such as a <div> and click handlers as "links", or buttons with <a> tags.
I can think of several other examples.
A "screenshot API" enables automation to capture "picture of a group of people celebrating" distinct from advertisements that might also appear on the page without the need to dig through CSS selectors/classes, domain name/query parameters, and handles cases where the image might be base64 embedded directly. Another example might be to simply/easily extract data from a Google Sheets/Excel table, which itself might include embedded images or non-text/HTML objects. Such an API could help accessibility by enabling screen readers for sites that weren't built to be accessible.
I learned from this post that you can tap/copy objects from photos in iOS 16! I just did this for the first time and it’s CRAZY, using a photo of my basement and a bunch of shoe boxes, tools, etc. I pressure tapped the group of shoe boxes and hit copy, pasted into the Bear app (a markdown editor for Apple devices), and it pasted just the shoe boxes, perfectly clipped around the edges as if I’d used photoshop!
I think ultimately, the point is that such an API which uses object detection, image-to-text, sentiment analysis, etc., on the backend could make trivial the tasks and edge cases that today require non-trivial effort and time, and could enrich the data prior to its retrieval.
> Parsing HTML isn’t trivial: aside from bad/invalid HTML (think: missing opening/closing tags, quotes, etc), there’s also a lot of content that requires javascript to render in the first place, for example, which means the page needs to be rendered and have access to window and DOM, etc.
Double standard. If you're going to make a fair comparison, then you need to compare like with like; you need to compare the subset of things about e.g. HTML that give you what you can also get with a screenshot. It makes no sense to hold the performance penalty of script execution against browser runtimes when (a) you don't have to execute any scripts to effect anything that gives you parity with a static image, and (b) you can't with static images do anything like what executable scripts enable.
And whether or not parsing HTML is trivial (which is debatable), it's still not strictly greater than the computational resources that are needed for the kind of computer vision and widgetry that lets you e.g. select the text in a screenshot...
Perhaps you could think of it like operating at a different abstraction layer, a sort of overlay network for social communication. It's easier to screenshot new technology than it is to make it interoperable with old technology, so there will always be a lead time where screenshots are the dominant sharing method driving a new hype cycle. Look at ChatGPT for example - they launched with no share feature, and people were sharing screenshots of their conversations. It's probably still the most common method of sharing them, even after OpenAI added a share feature. Screenshots have the advantage of working everywhere and being cheap to produce.
The loss of information is a solvable problem. In fact if you take a screenshot on an iPhone, there is already quite a bit of information stored along with the photo, but it's removed when using the WebKit <file> input to upload a photo to a website. Also, Safari has a "screenshot full page" option, meaning it already has a specific implementation of the screenshotting for its specific use case of taking a picture of a likely text-heavy Web page (which the same program taking the screenshot already rendered for the user anyway).
On the receiving side, Apple's automatic device-local OCR, which already runs on every photo in your camera roll, could increase fidelity of information in a screenshot that you receive. Apple could add a feature for your device to locate the original URL of a screenshot you receive, using both the metadata of screenshots generated on other Apple devices (if shared), and the results of searching the Web for text in an image.
This feature could have high accuracy for the limited cases where the content is available on the public web rather than on some ephemeral screen inside an app. But in fact, such cases are even more of a reason that screenshotting will remain a dominant way of storing and sharing information. They're completely decentralized and interoperable, with acceptable tradeoffs that double as advantages - like robustness (screenshot of content will survive longer than a server will continue serving that content), and privacy (the server of the original content doesn't get notified when you read it in a screenshot).
Not only will people not mark up their data in useful ways, often they're actively hostile to you using or remixing it, but the screenshot will always get through. Unless the display-driver level DRM gets involved.
a few years ago there were talks about having cameras shut down when they detect watermarked content to close the analog hole. Luckily it didn't go anywhere, but start hoarding cameras!
"Unstoppable" only as long as camera manufacturers don't add the stop. You cannot make a color photocopy of (at least European) paper currency. That sort of thinking could be pushed to cameras too.
(Though the original EURion was more about counterfeiting than about Hollywood studios.)
Or even better, a way of reproducing the work of Harmy's Despecialized Edition of Star Wars. You have a lot of frames of video in the blu-ray reissues that didn't really differ in content from the frames of video in the old laser disc, VHS, bad interlaced DVD transfers. I imagine it wouldn't be hard for AI to fill in the gaps to produce despecialized high-definition frames from the old Star Wars transfers.
In other cases, I'd imagine you could use old post production NTSC broadcast and the OG 35mm films to do something much more akin to how Star Trek TNG was updated, than a simple interpolated upsampling.
Dubious. Cloudflare and friends promote a culture of demonizing bots. For this to work you'd need to put as much energy into pretending to be a human as parsing the screenshot.
At least with a dedicated API you declare that you aren't hostile to a user who wants to use the website on their own terms.
> Cloudflare and friends promote a culture of demonizing bots
What does this mean? Usually it's the website owners who are hostile to scraping for commercial reasons, or simply bandwidth, to which Cloudflare is the solution; it's not some prejudice which has to be marketed to people.
> We know a third of web traffic comes from bots with their insatiable appetite for attacks. From credential stuffing, to stealing inventory, to price and content scraping, stopping bots is critical to a strong web experience for customers that is not undermined by bots.
This seems like classic demonizing tactics to me. Generalizing a whole class of users as an out-group, associating them with negative behaviour, painting them as criminals.
> Generalizing a whole class of users as an out-group, associating them with negative behaviour
Isn't this the reverse: the only criterion for this "out group" is behavior. Specifically that some people use a large number of automated clients to engage in behavior which the site owner regards (rightly or wrongly) as harmful. I can see that you're trying to draw racism analogies but that's not going to work.
And ("stealing inventory") sometimes this is at the expense of other customers, who might want to buy at the RRP from the official website rather than have to go through scalpers.
They added explicit qualifiers to “bots” that make apparent they are referring to abuse, not “bots” in the more general sense, such as a search crawler/indexer.
Do you disagree that credential stuffing, scalping, content-scraping (to build fake social media profiles, phish, and scam advertisers/consumers) are problems and that bots perform those activities?
Your defensiveness doesn’t seem justified, to me, unless you’re using bots for those purposes.
This is marketing copy so I assume that the sentence was deliberately constructed.
> We know a third of web traffic comes from bots with their insatiable appetite for attacks.
I read that run-on "with their" as a universal quantifier. There are no bots in that sentence that do not have an insatiable appetite. They could have added that quantifier to indicate the subset of bots they refer to:
> We know a third of web traffic comes from THOSE bots WITH AN insatiable appetite for attacks.
or even, simpler:
> We know a third of web traffic comes from bots with AN insatiable appetite for attacks
That's how I read the grammar. Language can be slippery, and can be made slipperyer. If we read it different ways we read it different ways.
To your question, of course those things are bad, and of course people use bots to do this. They also use web browsers and humans to do it. Some bots are bad. Some people are bad.
No. Mobile screenshots are a horrendous way to pass information around.
My father would send me screenshots of Amazon product listings and suggest I buy the item. There was never a visible URL, and typically no or only a small fraction of the product name. If he'd just sent a link, I could just click it, instead it became an investigation/research project.
Better yet was when the listing was obviously not Amazon, and not a site I could identify, or had ever seen before. I'd get zero context clues. Perhaps an AI could figure it out, but again, I'd rather he just sent a link.
When I get a screenshot, I'll have to submit it to some service, of which there will be a dozen different services, each requiring an account, each building a profile and tracking me based on whatever crap my dad sent me.
Unfortunately not, certainly not on mobile. I don't know if this is the case on iPhone, but on Android some apps make it really hard or impossible to screenshot them.
I think in recent Android versions they can even prevent the screen from getting mirrored to a connected PC / other casting device.
Even on Windows and Mac (don't know about Linux) when browsing, you sometimes can't take screenshot of a portion on the website (only seen this with videos, though).
We are not allowed to take a screenshot on a device we own, for content we paid for. What a world.
I think this has to do with the content in question being rendered on the GPU separately from the website itself. Historically I recall this being an issue with all forms of video display in some browsers but at this point (with website rendering generally being at least in part GPU accelerated) it shouldn't be an issue I think unless for some reason the browser is set up to use software rendering for the website itself.
I do faintly recall not being able to see video player content in screenshots taken on Ubuntu (but the player itself could take snapshots just fine) because of a divide like this some years ago.
A banking app an Android showed multiple obscure error messages when trying to log in and reporting the error was a PITA because they forbid screenshots.
I know sharing a screenshot of my bank statement is a bad idea, but forbidding it completely is a ridiculous extreme.
You prevent that by stopping any software in userland from being able to access the video path, which is an all or nothing thing.
Accessibility is only required if the app doesn't have system permissions, or if it's not carried out a privilege escalation. Though granted a privilege escalation might also be able to gain root to turn off setting a secure surface too (that's not enough to get past Widevine if that's been used but banks generally are not).
> I can vaguely understand the security concern, but it's my device, running my OS, why can't I make a simple screenshot?
It's not about screenshots, the point is that screen recording is blocked because the OS can't see the encrypted video path. As it can't there's nothing to take a screenshot of.
(Indeed, you can take a screenshot. It just comes out with all the video data nulled, so it's a black square.)
With some loss of fidelity, convenience and reaction time, you can always just point a camera at the screen. If no other device is available, a mirror might help.
I think my phone will take a photo without bringing up the camera app UI with a certain button combination. But I am not sure if I could make this use the front camera.
EDIT: I stand corrected, the button combination does not work when the phone is unlocked and it always uses the rear camera. But I guess at least in principle one could build this.
The silliest example of this I've seen is "private browsing" mode not allowing screenshots. The point of the mode for me and everyone I know is to have isolated browser sessions for the websites I visit and not have the browser add them to my browsing history. Not to shield what I'm seeing from the OS itself when I try to disclose it.
Maybe not, but it’s all the same if they can make the screenshot useless. Using Netflix as an example: when you take a screenshot video controls will be visible, but the content behind them won’t be.
> Screenshots are always available (similar to the era of web crawling)
DRM enters the room…
But on a serious note. I like the idea of using screenshots as a form of storage, but a lot of metadata is lost in the process like data hierarchy, data available in different visual states etc.
I was skeptical about this concept when I first heard it, until I actually tried some of the latest image models.
Now, I’m fully bought in that this is correct.
There’s a model you can use on hugging face, for example, where you can feed it any PDF or image of a document, ask it a query (“What is the total of the invoice?”), and it just spits out the right answer.
Turns out decades of work trying to make universal data interoperability standards will most likely be replaced by screenshots and images! (Again, this is for inter-app data movement)
From an invoice in a standardized XML format (some countries require this by low) total amount can be extracted in milliseconds if not microseconds. How long it takes to do this from an image?
>"Easier to parse than highly complex layout formats."
I disagree, the complexity is just hidden inside the model you have to train. What about a low resource device that could have difficulties in running the model? And how you handle the mistakes that the model will make?
I recently saw a post on Mastodon where someone figured out they could build a simple Automation[1] on their iPhone to assist them in avoiding foods that contain allergies. They take a picture of the label, then the automation OCRs the text and searches for $ALLERGEN.
It seems the technology for making "screenshot APIs" a less zany proposition is emerging.
Talked with a founder building something in this space that was inspired by Matt's post in October. lynq.ai (no affiliation, I just know Paul).
I think when you look at this problem originally, you say - that's a bad idea. Why would you take structured data, output it to non-structured format and then have ML parse that. Lots of wasted CPU cycles all around.
However, when you think about the complex dynamics of standards around documents, tens of years of digital formats, hundreds of standards, lack of adherence to those standards, proprietary formats, hundreds of years of print and legal documents, the argument is akin to self-driving cars.
The state we have today around data & dashboards is a hugely emergent & dynamic system, just like our road transport infrastructure. We are closer to a machine being able to navigate the same way a human can, than we are to one simple (or one set of) standard that work more the way a machine would want to consume.
Screenshots as a universal API simply meets the world where it is vs assuming the world is going to change towards something simpler and more elegant.
I think part of the problem with how this comes across at first glance is how it's framed. "Screenshots" as an API evokes some dirty feelings for most of us in tech because the format is so unstructured.
I think if you think about the idea of building something once that both a human and a machine consume from the same target (the UI), this makes a lot more sense in many ways even if it feels like there's an expensive level of indirection in there.
This seems like a last resort API. I would much rather have the information in a more convenient form than a screenshot... Also, some apps (e.g. banking apps), do not allow you to take screenshots on certain devices.
I've actually been building something like this: a desktop app that basically records everything you see and hear, and makes it searchable. Maybe we could team up: my email is govind <dot> gnanakumar <at> outlook <dot> com.
The Arc browser has a feature called Easels that does this; I quit like it even though it lacks some polish. I imagine they want to do more with it in the future
The browser is in beta and macOS only currently, but a friend recently invited me. E-mail me (in profile) if you'd like an invite to explore what they've done with it.
I can give out 5 invites I think, if anyone else wants it. So, first come first served!
Print-to-PDF, build your own local Internet .. best bookmark system ever, literally doesn't require anything to be installed, just put the resulting PDF's in folders and use pdfgrep and ag to your hearts content .. "ls -alF | grep thatarticlesubjectiremembervaguely"
They forgot to mention that the website owners may deliberately make it difficult to interface with the HTML directly, e.g. by changing the internal formatting structure every now and then.
At this point, AI also solves this problem with pretty reasonable accuracy. You can feed gpt3.5 some html and ask it to write a python script to parse all of the button text.
I use this technique to build a personal dashboard. Rather than try to scrape data, then come up with a nice presentation for it, I just find a nice representation on the web for the data I want on the dashboard, then use Puppeteer[1] to automatically screenshot the specific DOM element that contains the thing I want. Works like a champ.
It’s actually possible to at least make it harder to do screen grabs on iOS. But you can always take a photo of the phone’s screen using your other iPhone :)
I suppose this is satire? I mean, how would I call such an API to make any changes to the application state? Create an image of the effect that I would like to see? Am I missing here, or is the joke just lost on me?
If this is serious and the idea is to use this only for quick and dirty read-only access, I'll stick to my time-tested "select+copy" API for now. Usually does a better job at extracting what I care about (particularly for content longer than one screen page). Yes, app owners can make that harder if they want to, but the same is true for screenshots.
> Screenotate is an app for macOS and Windows that might help you with your screenshots. Every time you take a screenshot, Screenotate steps in to recognize and save the text inside (using Optical Character Recognition), along with the URL and the title of the place where you took the screenshot (where possible).
"Hey we have this gazillion flop/s available, shall we try to put it to some good use?" - "Nah, let's just waste them recovering information that was lost because it's so much simpler to share a screenshot instead of text or even html."
Sometimes, I wonder if the hardware industry sponsors a certain sunset of software developers to make sure every possible performance gain is immediately wasted. Electron was a strong hint, but this is a whole new level...