I recently came across arxiv.org/abs/2004.08900, which "assumes 2-3 runs" of T5-11B. In fact, we trained T5-11B *once*. That's why we spend 35 pages figuring out how we should train before we start training. You don't want to mess up a training run that big.

12:43 PM · Oct 5, 2020

9
17
2
240
Replying to @colinraffel
If you had the option to tram it again, what would you do differently
1
0
0
4
Replying to @colinraffel
are these 35 pages available/accessible publicly?
1
0
0
0
I was referring to the paper: arxiv.org/abs/1910.10683
0
0
0
3
Replying to @colinraffel
Just tag authors then: @yshoham
0
0
1
1
Replying to @colinraffel
I genuinely think people should have to pay carbon tax to run experiments of such scale. Whenever i mess up a hyperparameter for like a 12 hour single gpu run, i feel weirdly guilty
3
0
0
8
GCP is carbon neutral
1
0
0
3
Replying to @colinraffel
Good on you @colinraffel We’ll add a note! From our experience you can’t paper and pencil or theorem prove the parameter setting theorem, but some experimentation - with small models - is instructive. T5 is great; maybe could be made even greater?
0
0
0
3
Replying to @colinraffel
Wow computation expensive we are back to punchcard levels of planning runs
0
0
0
3
Replying to @colinraffel
That so true , couldn't agree more ... Screwing up even the smallest details while training such planet sized NLP models will cost huge !
0
0
0
0