decision-tree gpt Transformer
decision-tree based DeepSpeed implementation for spice MBFOP.
- Input
- 6650-dim embedding
- Encoder
- 37 x Transformer with 14 heads
- Output
- bleu projection
Training config
optimizer=LARS, lr=0.169, scheduler=linear, warmup=1106