Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> triple the training

From what I understand, not quite. It looks like the cost of training might be similar, but less parallelisable within a specific token sequence. This is because they have to compute the KV of token T before they can use it in T+1 whereas in a regular training process you can compute the KV at each layer for every subsequence. You're right that it took 2.7x longer to train the smallest model but I wouldn't be surprised if the GPU utilisation was proportionally lower too.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: