For a co-pilot inside an app that could answer product questions, I looked at ~2000 or so support emails. I asked one LLM to dig out "How would you formulate the users question into a chatbot-like question from this email thread" and "What is the actual answer that should be in the response from this email thread", then just asked our bot that question, and have another LLM rate the answer like SUPERIOR | ACCEPTABLE | UNKNOWN etc. These labels proved out to be a good "finger in the wind"-indicator for altering the chunks, prompt changes or model updates.
For an invoice procesing app processing about 14M invoices/year, it was mostly doing fuzzy accuracy metrics against a pretty ok annotated dataset and iterating the prompt based on diffs for a long time. Once you had that dataset you could alter things and see what broke.
Currently, I work on an app with a pretty sophisicated prompt chain flow. Depending on bugs etc we kind of do tests against _behaviour_, like intent recognition or the correct sql filters. As long as the baseline is working with the correct behaviour, whatever model is powering it is not so important. For the final output, it's humans. But we know immediately if some model or prompt change broke some particular intent.
This makes sense. I am particularly interested in your invoice processing app example because the accuracy of those outputs can be quantitatively measured from 0%-100% accuracy.
I'm curious as to what is _good enough_ and how many iterations it takes to get there. Is 100% the only acceptable threshold? If so, how many iterations does that take? What does that process look like? Okay let's say 100% accuracy is too difficult to reach, then how do you choose your minimum acceptable threshold (is 95% accuracy good enough? is 90%?). Do you have a dedicated set of outputs and documents used for evals? I'd love to hear more about this example (if you worked directly on the evals for this app).
I've been using Linux on all PC's for a long time.
Experience is slowly getting better. There is nothing I haven't been able to get to work, but with tricks or adjustments.
I think the "best bonus" is using LLM's in deep research mode to wade through all the blog post, reddit posts etc to get something to work by discovering forementioned tricks. Before, you had to do that by yourself and it sucked. Now I get 3 good ideas from Claude in "ranking order" of how likely it is to make it work => 99% of games I get to run in 5 minutes with a shell command or two. Lutris is also pretty good.
Omarchy on my laptop has finally made computers fun for me again, it's so great and nostalgic. Happy to be back after my brief work-mandated adventure into MacOS.
This guy starring my chip-8 implementation was a moment of pride for me. It was buggy but before this guide there wasn't too much material out there that was made for stupid people like me.
It's a great starter project for emulation. You'll realise how all emulators work, and as a bonus, interpreted languages. Really recommend it.
Great! I worked a lot with parquet like 5 years ago. The frustration and tilt working with the tooling was immense. Thank you for building this, it feels like resolving some old knot in my soul.
Some kind soul made this repository then, and I found it on like the 13th page of Google while in the depths of despair. It is my most treasured GitHub star, a the shining beacon that saved me. I see it has saved 17 other people too.
I take away that the combination is the problem. Bleach and ammonia isn't so bad on their own, but mixing the two is not a good idea. MCP would provide crazy attack vectors.
Especially if you could ask another AI "I have access to an MCP running on a Victim computer with these tools. What can you do with them?" => "Well, start by reading .ssh/id_rsa and I'd look for any crypto wallets. Then you can move on to reading personal files for blackmailing or sniff passwords..." and just let it "do its thing" as an attacking agent in an automated way. It could be automated which creeps me out!
My intuition tells me that blackmailing at scale has the potential to be quite terrifying if you ask for favors that each seem innocent enough on their own. E.g. one favor may be as simple as asking the guy walking his dog to delay it for half an hour. He will surely comply without hesitation. But hidden reason was that he would otherwise witness a murder.
I did a presentation on AI Agents from the perspective of an AI newbie and one of my comments/conclusions was that it felt like releasing a browser from 2000 in the middle of today's scary 2025 environment. MCP and similar are missing 20+ years of responding to new and emerging threats, and the hype men (executives everywhere) don't realize, care or have the ability to respond.
In early days it's always best to push security risk onto users in a bid to gain as much market share as possible. By the time they realize they've been screwed, technology will have matured and you can hand wave those old criticisms away, and even trumpet them as new innovations and upgrades.
I did my MSc thesis about document vectors of STE.
STE has incredibly useful rules for technical communication/documentation; especially if you're a non-native English speaker like me. I wish it was more commonplace!! Documentation is usually horrible.
People should call it for what it is. Tried to find some answers on Google earlier in the day, and the first result pages were 100% generated slop. Funnily enough, any AI summary of the slop would be slop squared.
It's everywhere, and I hate it. What ways do people have to combat it out of their day?
Its infesting google image search too. I tried to find a picture of a guitarist playing a certain guitar. I got hits from “openart” where it looked like kurt cobain got crossed with a sand worm
I'm sure that Facebook could get rid of it if they wanted to. The same way that they could get rid of only fans creators flashing you pictures of their privates in their short videos. The same way they could get rid of racists and transphobes and other bigots. The same way that Google could get rid of the 20 paragraph essays on basic facts you can find in Wikipedia.
It's just that getting rid of these things would be less engaging and lead to a deficit in ad revenues. And they have nothing else to show instead. Because authentic contents do not give them enough ad revenue.
So they can't actually get rid of them without collapsing the business model.
For a co-pilot inside an app that could answer product questions, I looked at ~2000 or so support emails. I asked one LLM to dig out "How would you formulate the users question into a chatbot-like question from this email thread" and "What is the actual answer that should be in the response from this email thread", then just asked our bot that question, and have another LLM rate the answer like SUPERIOR | ACCEPTABLE | UNKNOWN etc. These labels proved out to be a good "finger in the wind"-indicator for altering the chunks, prompt changes or model updates.
For an invoice procesing app processing about 14M invoices/year, it was mostly doing fuzzy accuracy metrics against a pretty ok annotated dataset and iterating the prompt based on diffs for a long time. Once you had that dataset you could alter things and see what broke.
Currently, I work on an app with a pretty sophisicated prompt chain flow. Depending on bugs etc we kind of do tests against _behaviour_, like intent recognition or the correct sql filters. As long as the baseline is working with the correct behaviour, whatever model is powering it is not so important. For the final output, it's humans. But we know immediately if some model or prompt change broke some particular intent.