Close Menu
    Facebook X (Twitter) Instagram
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Facebook X (Twitter) Instagram
    Fintech Fetch
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Fintech Fetch
    Home»AI News»How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp
    How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp
    AI News

    How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

    June 2, 20263 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email
    kraken

    rewrite this content and keep HTML tags as is. This is content from rss feed and I don’t need their *Daily Debrief Newsletter*, their tags from bottom like this *Share this articleCategoriesTags*, Editorial Process section, phrases like *Featured image from Peakpx, chart from Tradingview.com*, SPECIAL OFFERS and similar sections – just remove such sections and save only article itself:

    print(“\n### SECTION D: end-to-end Transformer (vanilla fp32 vs Apex fused + AMP) ###”)
    VOCAB, D, NHEAD, LAYERS, SEQ, BATCH, STEPS = 2000, 256, 4, 4, 128, 32, 60
    class Block(torch.nn.Module):
    def __init__(self, d, nhead, norm_cls):
    super().__init__()
    self.attn = torch.nn.MultiheadAttention(d, nhead, batch_first=True)
    self.ff = torch.nn.Sequential(torch.nn.Linear(d, 4 * d), torch.nn.GELU(),
    torch.nn.Linear(4 * d, d))
    self.n1, self.n2 = norm_cls(d), norm_cls(d)
    def forward(self, x):
    h = self.n1(x); x = x + self.attn(h, h, h, need_weights=False)[0]
    return x + self.ff(self.n2(x))
    class TinyTransformer(torch.nn.Module):
    def __init__(self, norm_cls):
    super().__init__()
    self.emb = torch.nn.Embedding(VOCAB, D)
    self.blocks = torch.nn.ModuleList([Block(D, NHEAD, norm_cls) for _ in range(LAYERS)])
    self.norm = norm_cls(D)
    self.head = torch.nn.Linear(D, VOCAB)
    def forward(self, idx):
    x = self.emb(idx)
    for b in self.blocks:
    x = b(x)
    return self.head(self.norm(x))
    g = torch.Generator(device=”cpu”).manual_seed(0)
    data = torch.randint(0, VOCAB, (BATCH, SEQ + 1), generator=g).to(DEV)
    inp, tgt = data[:, :-1], data[:, 1:]
    lossfn = torch.nn.CrossEntropyLoss()
    def run_training(use_apex):
    torch.manual_seed(0)
    norm_cls = (FusedLayerNorm if (use_apex and HAS_FLN and APEX_OK) else torch.nn.LayerNorm)
    model = TinyTransformer(norm_cls).to(DEV)
    if use_apex and HAS_AMP_C and APEX_OK:
    optimizer = FusedAdam(model.parameters(), lr=3e-4)
    else:
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
    scaler = torch.amp.GradScaler(“cuda”, enabled=use_apex)
    def one_step():
    optimizer.zero_grad(set_to_none=True)
    with torch.amp.autocast(“cuda”, dtype=torch.float16, enabled=use_apex):
    logits = model(inp)
    loss = lossfn(logits.reshape(-1, VOCAB), tgt.reshape(-1))
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    return loss
    for _ in range(5):
    one_step()
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(STEPS):
    loss = one_step()
    torch.cuda.synchronize()
    dt = time.perf_counter() – t0
    return loss.item(), (STEPS * BATCH * SEQ) / dt, dt
    loss_v, tps_v, dt_v = run_training(use_apex=False)
    print(f” vanilla (fp32, nn.LayerNorm, AdamW) : ”
    f”{dt_v:5.2f}s | {tps_v:9.0f} tok/s | final loss {loss_v:.3f}”)
    if APEX_OK and (HAS_AMP_C or HAS_FLN):
    loss_a, tps_a, dt_a = run_training(use_apex=True)
    print(f” apex (fp16, FusedLayerNorm, FusedAdam) : ”
    f”{dt_a:5.2f}s | {tps_a:9.0f} tok/s | final loss {loss_a:.3f}”)
    print(f” —-> speedup: {tps_a / tps_v:0.2f}x throughput”)
    else:
    print(” apex path SKIPPED (no fused kernels built)”)
    print(“\n” + “=” * 78)
    print(“DONE. Key takeaways:”)
    print(” – FusedAdam/FusedLayerNorm/FusedRMSNorm are the still-relevant Apex pieces;”)
    print(” speedups grow with model size & parameter count (tiny demo understates it).”)
    print(” – apex.amp is deprecated -> prefer torch.amp.autocast + torch.amp.GradScaler.”)
    print(” – FusedAdam composes cleanly with native torch.amp (Section D).”)
    print(” – On real workloads, also try a larger model and bf16 autocast (no scaler needed).”)
    print(“=” * 78)
    binance
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Fintech Fetch Editorial Team
    • Website

    Related Posts

    Media Advisory: MIT to establish regional quantum hub | MIT News

    Media Advisory: MIT to establish regional quantum hub | MIT News

    June 1, 2026
    OpenAI governance frameworks secure enterprise AI deployments

    OpenAI governance frameworks secure enterprise AI deployments

    May 30, 2026
    Liquid AI Releases LFM2.5-8B-A1B: An On-Device MoE Model With 8.3B Total and 1.5B Active Parameters

    Liquid AI Releases LFM2.5-8B-A1B: An On-Device MoE Model With 8.3B Total and 1.5B Active Parameters

    May 29, 2026
    Building AI models that understand chemical principles | MIT News

    Building AI models that understand chemical principles | MIT News

    May 28, 2026
    Add A Comment

    Comments are closed.

    Join our email newsletter and get news & updates into your inbox for free.


    Privacy Policy

    Thanks! We sent confirmation message to your inbox.

    kraken
    Latest Posts
    Asian Markets Trade Mixed | Nasdaq

    rewrite this title in other words: Asian Markets Trade Mixed | Nasdaq

    June 2, 2026
    How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

    How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

    June 2, 2026
    Cointelegraph

    Radiant to Wind Down After Failing to Recover From 2024 Hack

    June 2, 2026
    Updated Essential AI Skills For 2026

    Updated Essential AI Skills For 2026

    June 1, 2026
    How to Use Google Gemini Al (Full Tutorial)

    How to Use Google Gemini Al (Full Tutorial)

    June 1, 2026
    coinbase
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights
    Cointelegraph

    EdgeX Blames Outsider for EDGE Token Crash as ZachXBT Alleges Insider Manipulation

    June 2, 2026
    Sosnick Warns Crypto's 'Tourists' Are Cashing out as Bitcoin ETFs Bleed $1.42 Billion

    rewrite this title in other words: Sosnick Warns Crypto’s ‘Tourists’ Are Cashing out as Bitcoin ETFs Bleed $1.42 Billion

    June 2, 2026
    bybit
    Facebook X (Twitter) Instagram Pinterest
    © 2026 FintechFetch.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.