Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't really know how these models really work, but I had a theory that just as the models have limited attention so do the safety layers. I simply populated enough context with 'malicious' text without making the model trip that "wasted" the internal attention budget on tokens early in the prompt completely ignoring all the tokens that were generated after the fact.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: