Does anyone have any solid patterns they can share around the “scenarios”/holdouts concept from the Dark Factory, where you create external system(s) to verify your main one?
I have thought about this a lot, and I have no idea. I work for an "AI-first" company, and we're kind of required to use AI stuff as often as we can, so I make very liberal use of Codex, but I've been shielded from the interview process thus far.
I think I would still kind of ask the same questions, though maybe a bit more conceptual. Like, for example, I might see if I could get someone to explain how to build something, and then ask them about data structures that might be useful (e.g. removing a lock by making an append-only structure). I find that Codex will generally generate something that "works" but without an understanding data structures and algorithms, its implementation will still be somewhat sub-optimal, meaning that understanding the fundamentals has value, at least for now.
For a longer and more biting critique of SF one should read
Private Citizens (2016) by Tony Tulathimutte
“ Capturing the anxious, self-aware mood of young college grads in the aughts, Private Citizens embraces the contradictions of our new century: call it a loving satire.”
I picked up Rejection, he has a keen sense of observation and understanding of people. Still, I found the variations-on-a-theme stories to be a downer, or at least repetitive. By the 3rd story I was hoping for another direction.
Location: NYC, New York, Brooklyn, Queens
Remote: Sure, or hybrid
Willing to relocate: Yea
Technologies: Python, Js, Go, PHP, React, Django, and more
Résumé/CV: https://www.linkedin.com/in/dacks-m-9451428a/
About: I'm a hands-on engineering leader who builds teams that ship products and drive revenue. At Quizlet, I led a 30-person org across Ads and Growth, scaling advertising revenue from $20M to $50M+. Before that, I was CTO at Slader, where I grew the engineering team from 3 to 14 and led the company through its acquisition by Quizlet. I'm happy to be hands-on as much as needed.
How do people evaluate creative writing and emotional intelligence in LLMs? Most benchmarks seem to focus on reasoning or correctness, which feels orthogonal. I’ve been playing with Kimmy K 2.5 and it feels much stronger on voice and emotional grounding, but I don’t know how to measure that beyond human judgment.
https://openai.com/index/harness-engineering/