Eventually we will get AI based title analysers that simulate the conditions to predict and optimise where your title will land on the front-page listing order.
I absolutely want to read some of these. There needs to be a way to for people to post and rank the best matching articles that actually exist. I dream of the day that generative AI is repurposed to act as a more effective search engine.
Unfortunately it doesn't work anymore, but the titles it generated for the papers were all plausible but also ridiculous CS paper titles, e.g. Rooter: A Methodology for the Typical Unification of Access Points and Redundancy
I got "Install the NuGet package manager on a Mac" which I'm still not sure is ridiculously infeasible or the kind of hack somebody might actually manage to pull off. Definitely HN-worthy if they manage it!
Is there a dataset of HN titles? This made me want to fiddle with this, but step one is to get the data, and I don't want to crawl HN if the data has already been collected.
There's an API[0] but it's frustratingly limited in capabilities (albeit not rate-limited.) You'll have to iterate all post IDs, download each post as JSON and get the titles that way.
There's also a Google dataset but I don't know the URL for it or if it's up to date.
What's missing precisely? Seems to be good enough for every use I could think of.
One time I even downloaded every single item from it, with a threaded fetcher of I think 16 threads, iterating from 1 up to latest ID and it was done in some like 2 hours I think.
No ability to directly download threads with a single request, for one, or query it like a database to sort or filter results, exclude unwanted fields, etc.
Those things should be trivial to achieve with most general purpose languages, as the API is so simple. No need for pagination or other things, just request things by ID recursively and you get the full thing, then after than filter/select whatever you want.
Pseudo-code to show how simple it would be:
function get_thread(id) {
let item = http.get(`{api}/?id={id}`).body
if item.childs {
item.fetched_children = item.childs.map((id) => {
return get_thread(id)
})
}
return item
}
(untested, but you don't really need more than that, besides checking if the item was deleted)
Yes, and i've done it. But you still wind up having to make a separate request for each item, which makes building threads incredibly slow. It's also a waste of time if you're filtering out anything, because you still have to make the request and download the item to filter it out.
Which is why it would be preferable for the API itself to support these features.
It averages around 0.0005$ per request according to the footer. Could this end up costing the author quite a bit due to HN traffic? Also, whats stopping bad actors from writing a script to continuously fetch the page?
I wonder if some sort of caching might help lower costs.
There's intentionally no caching, every batch is warm from the AI oven.
HN usually drives around 10k visits, so organically it's going to be well within my Saturday night budget. If someone decides to hammer it, well, the OpenAI account has a hard limit of $20/month. It will live until it's killed I guess.
Nothing. And in fact that’s exactly what happened to me. Some fella from HN spawned like 45 simultaneous wget’s in a loop to cause maximum financial damage. All of a sudden we see Firebase’s cost graph go vertical.
It happened after I mentioned “just be kind, please! Theoretically this could cost a lot of money.”
So there’s at least one person who will do exactly this just for fun.
Firebase customer support was super cool about it, but it still knocked us off the paid tier.
On mobile, after clicking on the link and getting distracted for a second, I could not tell that I was not on HN ! I even tried to click multiple links until I reached the bottom of the page and remembered !
Right. The only reason I didn't get confused for long is because I use a userstyle to render HN in dark mode, and this one displayed "plain". But my immediate reaction wasn't "hey, that's fake HN", but "hey, why is this HN tab rendering in default style?!".
Would love to see it go a step further and do the same thing for comment threads as well. Think I could possibly lose hours on a site like that rather than minutes. This was a lot of fun as well though.
"Ask HN: What's your strategy for winning the lottery?"
"Computer Science at MIT Does Not Exist."
Still other ones I would use as writing prompts for a blog. It would be about things an ML model produced because it thought this was what you would think would be popular. (I defy any ML model to craft a more painful sentence.)
"The first hour of a movie is often worth the whole movie"
"Transhumanist manifesto: explore my visions for the future"
"Fill up on technology. Technology freaks out. Tell your kids to build a crisis framework"
Super easy, just took 10k titles/comments/points from the Angolia API, formatted them as JSON Lines like the following with jq, and fed them to the very well built and documented openai CLI.
{"prompt": "A plausible Hacker News title:", "completion": " The Feynman Lectures on Physics (1964) (280 points, 62 comments) END"}
The space at the beginning of the completion is for tokenizing, and the END token is for use as a stop token in the generations.
Show HN: Daily XLSX to-do list with attached spreadsheet as material*
Banning JavaScript from web pages is bad for the user
Has Google become too social?
SQLite development from scratch from scratch
Armor-piercing lasers are not shooting lasers but missiles
We made a public blockchain off-chain
Apple sued for pricing user data against provider who did not provide refunds
100% Embarrassing Haskell Builds
Heroku Compose is not fit for purpose
Pain Enhancer
The distance between reality and fantasy is grows ever smaller.