> triple the training From what I understand, not quite. It looks like the cost ...

> triple the training

From what I understand, not quite. It looks like the cost of training might be similar, but less parallelisable within a specific token sequence. This is because they have to compute the KV of token T before they can use it in T+1 whereas in a regular training process you can compute the KV at each layer for every subsequence. You're right that it took 2.7x longer to train the smallest model but I wouldn't be surprised if the GPU utilisation was proportionally lower too.