tinygrad 0.9.0

Close to the new line limit of 8000 lines, sitting at 7958 lines. tinygrad is much more usable now.

Just over 1200 commits since 0.8.0.

Release Highlights

New documentation: https://docs.tinygrad.org
gpuctypes has been brought in tree and is no longer an external dependency. [#3253]
AMD=1 and NV=1 experimental backends for not requiring any userspace runtime components like ROCm or CUDA.
- These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
PTX=1 for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]
Nvidia tensor core support. [#3544]
THREEFRY=1 for numpy-less random number generation using threefry2x32. [#2601] [#3785]
More stabilized multi-tensor API.
- With ring all-reduce: [#3000] [#3852]
Core tinygrad has been refactored into 4 pieces, read more about it here.
Linearizer and codegen has support for generating kernels with multiple outputs.
Lots of progress towards greater kernel fusion in the scheduler.
- Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
- New LoadOps.ASSIGN allows fusing optimizer updates with grad.
- Schedule kernels in BFS order. This improves resnet and llama speed.
- W.I.P. for fusing multiple reduces: [#4259] [#4208]
MLPerf ResNet and BERT with a W.I.P. UNet3D
Llama 3 support with a new llama3.py that provides an OpenAI compatible API. [#4576]
NF4 quantization support in Llama examples. [#4540]
label_smoothing has been added to sparse_categorical_crossentropy. [#3568]

Using tinygrad in a conda env on macOS is known to cause problems with the METAL backend. See #2226.

Close to the new line limit of 8000 lines, sitting at 7958 lines. tinygrad is much more usable now.

Just over 1200 commits since 0.8.0.

New documentation: https://docs.tinygrad.org
gpuctypes has been brought in tree and is no longer an external dependency. [#3253]
AMD=1 and NV=1 experimental backends for not requiring any userspace runtime components like ROCm or CUDA.
- These backends should reduce the amount of python time, and specifically with multi-gpu use cases.
PTX=1 for rendering directly to ptx instead of cuda. [#3139] [#3623] [#3775]
Nvidia tensor core support. [#3544]
THREEFRY=1 for numpy-less random number generation using threefry2x32. [#2601] [#3785]
More stabilized multi-tensor API.
- With ring all-reduce: [#3000] [#3852]
Core tinygrad has been refactored into 4 pieces, read more about it here.
Linearizer and codegen has support for generating kernels with multiple outputs.
Lots of progress towards greater kernel fusion in the scheduler.
- Fusing of ReduceOps with their elementwise children. This trains mnist and gpt2 with ~20% less kernels and makes llama inference faster.
- New LoadOps.ASSIGN allows fusing optimizer updates with grad.
- Schedule kernels in BFS order. This improves resnet and llama speed.
- W.I.P. for fusing multiple reduces: [#4259] [#4208]
MLPerf ResNet and BERT with a W.I.P. UNet3D
Llama 3 support with a new llama3.py that provides an OpenAI compatible API. [#4576]
NF4 quantization support in Llama examples. [#4540]
label_smoothing has been added to sparse_categorical_crossentropy. [#3568]

Using tinygrad in a conda env on macOS is known to cause problems with the METAL backend. See #2226.