More

fluffet · 2026-03-11T13:56:53 1773237413

It's kind of bespoke for me tbh.

For a co-pilot inside an app that could answer product questions, I looked at ~2000 or so support emails. I asked one LLM to dig out "How would you formulate the users question into a chatbot-like question from this email thread" and "What is the actual answer that should be in the response from this email thread", then just asked our bot that question, and have another LLM rate the answer like SUPERIOR | ACCEPTABLE | UNKNOWN etc. These labels proved out to be a good "finger in the wind"-indicator for altering the chunks, prompt changes or model updates.

For an invoice procesing app processing about 14M invoices/year, it was mostly doing fuzzy accuracy metrics against a pretty ok annotated dataset and iterating the prompt based on diffs for a long time. Once you had that dataset you could alter things and see what broke.

Currently, I work on an app with a pretty sophisicated prompt chain flow. Depending on bugs etc we kind of do tests against _behaviour_, like intent recognition or the correct sql filters. As long as the baseline is working with the correct behaviour, whatever model is powering it is not so important. For the final output, it's humans. But we know immediately if some model or prompt change broke some particular intent.

yelmahallawy · 2026-03-12T07:07:52 1773299272

This makes sense. I am particularly interested in your invoice processing app example because the accuracy of those outputs can be quantitatively measured from 0%-100% accuracy.

I'm curious as to what is _good enough_ and how many iterations it takes to get there. Is 100% the only acceptable threshold? If so, how many iterations does that take? What does that process look like? Okay let's say 100% accuracy is too difficult to reach, then how do you choose your minimum acceptable threshold (is 95% accuracy good enough? is 90%?). Do you have a dedicated set of outputs and documents used for evals? I'd love to hear more about this example (if you worked directly on the evals for this app).

fluffet · 2026-01-02T13:29:17 1767360557

I've been using Linux on all PC's for a long time.

Experience is slowly getting better. There is nothing I haven't been able to get to work, but with tricks or adjustments.

I think the "best bonus" is using LLM's in deep research mode to wade through all the blog post, reddit posts etc to get something to work by discovering forementioned tricks. Before, you had to do that by yourself and it sucked. Now I get 3 good ideas from Claude in "ranking order" of how likely it is to make it work => 99% of games I get to run in 5 minutes with a shell command or two. Lutris is also pretty good.

Omarchy on my laptop has finally made computers fun for me again, it's so great and nostalgic. Happy to be back after my brief work-mandated adventure into MacOS.

fluffet · 2025-12-06T11:29:07 1765020547

Happy to see this :-)

This guy starring my chip-8 implementation was a moment of pride for me. It was buggy but before this guide there wasn't too much material out there that was made for stupid people like me.

It's a great starter project for emulation. You'll realise how all emulators work, and as a bonus, interpreted languages. Really recommend it.

fluffet · 2025-11-18T16:20:22 1763482822

Great! I worked a lot with parquet like 5 years ago. The frustration and tilt working with the tooling was immense. Thank you for building this, it feels like resolving some old knot in my soul.

Some kind soul made this repository then, and I found it on like the 13th page of Google while in the depths of despair. It is my most treasured GitHub star, a the shining beacon that saved me. I see it has saved 17 other people too.

https://github.com/casidiablo/parquet-tools-for-dumb-people-...

fluffet · 2025-05-01T14:27:48 1746109668

I take away that the combination is the problem. Bleach and ammonia isn't so bad on their own, but mixing the two is not a good idea. MCP would provide crazy attack vectors.

Especially if you could ask another AI "I have access to an MCP running on a Victim computer with these tools. What can you do with them?" => "Well, start by reading .ssh/id_rsa and I'd look for any crypto wallets. Then you can move on to reading personal files for blackmailing or sniff passwords..." and just let it "do its thing" as an attacking agent in an automated way. It could be automated which creeps me out!

im3w1l · 2025-05-01T17:09:29 1746119369

My intuition tells me that blackmailing at scale has the potential to be quite terrifying if you ask for favors that each seem innocent enough on their own. E.g. one favor may be as simple as asking the guy walking his dog to delay it for half an hour. He will surely comply without hesitation. But hidden reason was that he would otherwise witness a murder.

eMPee584 · 2025-05-01T15:23:55 1746113035

Don't you give THEM ideas!

fluffet · 2025-05-01T13:54:20 1746107660

Woah, I had no idea. Thanks for the article.

I feel like some cycle phenomenon has been reached here..

The first protocols of the internet were very naive. Why'd you need to encrypt traffic? What do you mean exploit DNS, why would anyone do that?

Then people realised that the internet is a really, really wild place and that won't do.

I suddenly feel old, because this new AI tool era seems to have forgotten that lesson.

I feel it's like watching crypto learn by any% speedrunning why regulations and oversight might be a good in the first place (FTX and such).

I hope the next generation of AI tech/protocols are more robust, trust just doesn't cut it, or we'll see plenty of fingers being burnt at the stove.

dowager_dan99 · 2025-05-01T14:54:28 1746111268

I did a presentation on AI Agents from the perspective of an AI newbie and one of my comments/conclusions was that it felt like releasing a browser from 2000 in the middle of today's scary 2025 environment. MCP and similar are missing 20+ years of responding to new and emerging threats, and the hype men (executives everywhere) don't realize, care or have the ability to respond.

outworlder · 2025-05-01T23:54:50 1746143690

Is the presentation public?

deadbabe · 2025-05-01T14:26:42 1746109602

In early days it's always best to push security risk onto users in a bid to gain as much market share as possible. By the time they realize they've been screwed, technology will have matured and you can hand wave those old criticisms away, and even trumpet them as new innovations and upgrades.

esafak · 2025-05-01T14:58:03 1746111483

It's a new technology so it is understandable that practitioners are not aware of the security best practices, like https://genai.owasp.org/

Also, the security tooling is still nascent.

fluffet · on April 9, 2025

Solid points!

Shame the author doesn't mention the Swedish secret of snus. That's the best productivity hack I know bar none. Anyone else out there?

fluffet · on June 12, 2024

What a nostalgia trip!

I did my MSc thesis about document vectors of STE.

STE has incredibly useful rules for technical communication/documentation; especially if you're a non-native English speaker like me. I wish it was more commonplace!! Documentation is usually horrible.

Bluestein · on June 12, 2024

Thanks for bringing that up. I ran into a whole thread on how "bad English" (so called) was/is a lingua franca ...

... and ran into STE, which sounds like a grand idea.-

fluffet · on June 11, 2024

Love the term.

People should call it for what it is. Tried to find some answers on Google earlier in the day, and the first result pages were 100% generated slop. Funnily enough, any AI summary of the slop would be slop squared.

It's everywhere, and I hate it. What ways do people have to combat it out of their day?

kjkjadksj · on June 11, 2024

Its infesting google image search too. I tried to find a picture of a guitarist playing a certain guitar. I got hits from “openart” where it looked like kurt cobain got crossed with a sand worm

fluffet · on May 2, 2024

Interesting read.

Is there a case for KYC versions of platforms? Maybe platform knows who you are, but you can remain anonymous "outwards"?

All forms of automated detection and generation will work like GANs and race each other.. Leads me to believe something else is the answer.

barfbagginus · on May 2, 2024

I'm sure that Facebook could get rid of it if they wanted to. The same way that they could get rid of only fans creators flashing you pictures of their privates in their short videos. The same way they could get rid of racists and transphobes and other bigots. The same way that Google could get rid of the 20 paragraph essays on basic facts you can find in Wikipedia.

It's just that getting rid of these things would be less engaging and lead to a deficit in ad revenues. And they have nothing else to show instead. Because authentic contents do not give them enough ad revenue.

So they can't actually get rid of them without collapsing the business model.