tinygrad 0.9.0
Close to the new line limit of 8000 lines, sitting at 7958 lines. tinygrad is much more usable now.
Just over 1200 commits since 0.8.0.
Release Highlights
- New documentation: https://docs.tinygrad.org
gpuctypeshas been brought in tree and is no longer an external dependency. [#3253]AMD=1andNV=1experimental backends for not requiring any userspace runtime components like ROCm or CUDA.- These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
PTX=1for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]- Nvidia tensor core support. [#3544]
THREEFRY=1for numpy-less random number generation using threefry2x32. [#2601] [#3785]- More stabilized multi-tensor API.
- With ring all-reduce: [#3000] [#3852]
- Core tinygrad has been refactored into 4 pieces, read more about it here.
- Linearizer and codegen has support for generating kernels with multiple outputs.
- Lots of progress towards greater kernel fusion in the scheduler.
- Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
- New LoadOps.ASSIGN allows fusing optimizer updates with grad.
- Schedule kernels in BFS order. This improves resnet and llama speed.
- W.I.P. for fusing multiple reduces: [#4259] [#4208]
- MLPerf ResNet and BERT with a W.I.P. UNet3D
- Llama 3 support with a new
llama3.pythat provides an OpenAI compatible API. [#4576] - NF4 quantization support in Llama examples. [#4540]
label_smoothinghas been added tosparse_categorical_crossentropy. [#3568]
Known Issues
- Using tinygrad in a conda env on macOS is known to cause problems with the
METALbackend. See #2226.