-
Notifications
You must be signed in to change notification settings - Fork 613
[PyTorch] Documentation for op fuser API #2447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Greptile OverviewGreptile SummaryAdded comprehensive documentation for the operation fuser API including a detailed usage guide with code examples, diagrams illustrating operation fusion patterns, and improved docstring formatting across Python modules.
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Dev as Developer
participant Docs as Documentation System
participant API as API Docs Generator
participant Code as Python Modules
Dev->>Docs: Create op_fuser.rst guide
Note over Docs: Basic usage examples<br/>Quantization guide<br/>Branching operations<br/>Implementation details
Dev->>Docs: Add example diagrams
Note over Docs: layernorm_mlp.png<br/>fp8_layernorm_linear.png<br/>residual_layernorm_mlp.png
Dev->>Code: Update docstrings
Note over Code: Fix backtick formatting<br/>Fix hyperlink spacing<br/>Standardize None/True/False
Code->>Code: activation.py
Code->>Code: basic_linear.py
Code->>Code: linear.py
Code->>Code: op.py
Code->>Code: other ops modules
Dev->>API: Add op fuser classes
Note over API: Sequential<br/>FusibleOperation<br/>Linear<br/>All operation classes
API->>Docs: Generate API reference
Dev->>Docs: Update index.rst
Docs->>Docs: Include op_fuser guide in TOC
Note over Dev,Docs: Complete documentation<br/>for op fuser API
|
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Review suggestion from @greptile-apps Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
This comment was marked as outdated.
This comment was marked as outdated.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
transformer_engine/pytorch/ops/basic/activation.py, line 387 (link)syntax: Extra space before period.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
19 files reviewed, 1 comment
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
|
/te-ci core pytorch |
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
| At the most basic level, the operation fuser API involves two classes | ||
| in the ``transformer_engine.pytorch.ops`` submodule: | ||
|
|
||
| - ``FusibleOperation``: An abstract base class for tensor operations. | ||
| Examples include ``Linear``, ``LayerNorm``, and ``AllReduce``. It is | ||
| a subclass of ``torch.nn.Module``, so it can hold trainable | ||
| parameters and can be called to perform the operation's forward | ||
| pass. | ||
| - ``Sequential``: A container of modules in sequential order. It has a | ||
| very similar interface as ``torch.nn.Sequential``. If it contains | ||
| any ``FusibleOperation`` s, then it may attempt to fuse them in the | ||
| forward and backward passes. | ||
|
|
||
| Thus, using the operation fuser simply involves constructing | ||
| ``FusibleOperation`` s and passing them into a ``Sequential``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is the intended audience of this documentation? On one hand it seems it is the user (since you show examples of how things could be written), on the other you also include the details of the implementation.
| This is an expert technique. Quantizer configurations can be quite | ||
| complicated, so the ``Quantize`` operation's quantizers may be | ||
| suboptimal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what that means - any examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MXFP8, it's not safe for the quantize op to produce a MXFP8Tensor with swizzled scales. There's no way to know if it will consumed by a GEMM or by something else.
| the block has been split into two sections, each with one branching | ||
| operation. | ||
|
|
||
| Implementation details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this file should be split into 2 (maybe 3) separate sections - one primarily user facing with the sections describing how to use sequential, maybe second one showing how to define your own fusion with a user-provided kernel, and then the third one showing those internal implementation details.
| - **The op fuser is not interchangeable with the monolithic TE | ||
| modules**: Modules like ``Linear``, ``LayerNormLinear``, and | ||
| ``TransformerLayer`` support a wide range of features and advanced | ||
| workflows, which makes them challenging to decompose into simple | ||
| operations that work with the fuser. They are also carefully | ||
| hand-tuned to achieve maximum performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to get to the point where the sequential is the default, right? So while right now this is true, it may not be in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
| `GLU Variants Improve Transformer<https://arxiv.org/abs/2002.05202>`__ | ||
| and `Gaussian Error Linear Units (GELUs)<https://arxiv.org/abs/1606.08415>`__. | ||
| `GLU Variants Improve Transformer <https://arxiv.org/abs/2002.05202>`__ | ||
| and `Gaussian Error Linear Units (GELUs) <https://arxiv.org/abs/1606.08415>`__ . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space before period
| and `Gaussian Error Linear Units (GELUs) <https://arxiv.org/abs/1606.08415>`__ . | |
| and `Gaussian Error Linear Units (GELUs) <https://arxiv.org/abs/1606.08415>`__. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| .. warning:: | ||
| The input tensor is chunked along the last dimension to get | ||
| gates/pre-activations which is differnt from GPT OSS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "differnt" should be "different"
| gates/pre-activations which is differnt from GPT OSS | |
| gates/pre-activations which is different from GPT OSS |
Description
This PR adds a basic usage guide for the op fuser and includes it in the autogenerated API docs.
It is ready as-is, but if reviews take a while I may expand it with a guide on creating custom fused ops.
Type of change
Changes
Checklist: