Oven logo

Oven

Published

optimizer & lr scheduler & objective function collections in PyTorch

pytorch-optimizer

Buildworkflow Documentation Status
Qualitycodecov black ruff
PackagePyPI version PyPI pyversions
StatusPyPi download PyPi month download
Licenseapache

The reasons why you use pytorch-optimizer.

  • Wide range of supported optimizers. Currently, 106 optimizers (+ bitsandbytes, qgalore, torchao), 16 lr schedulers, and 13 loss functions are supported!
  • Including many variants such as ADOPT, Cautious, AdamD, StableAdamW, and Gradient Centrailiaztion
  • Easy to use, clean, and tested codes
  • Active maintenance
  • Somewhat a bit more optimized compared to the original implementation

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage, Nero have CC BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.

Installation

$ pip3 install pytorch-optimizer

From v2.12.0, v3.1.0, you can use bitsandbytes, q-galore-torch, torchao optimizers respectively! please check the bnb requirements, q-galore-torch installation, torchao installation before installing it.

From v3.0.0, drop Python 3.7 support. However, you can still use this package with Python 3.7 by installing with --ignore-requires-python option.

Simple Usage

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters())

# or you can use optimizer loader, simply passing a name of the optimizer.

from pytorch_optimizer import load_optimizer

optimizer = load_optimizer(optimizer='adamp')(model.parameters())

# if you install `bitsandbytes` optimizer, you can use `8-bit` optimizers from `pytorch-optimizer`.

optimizer = load_optimizer(optimizer='bnb_adamw8bit')(model.parameters())

Also, you can load the optimizer via torch.hub.

import torch

model = YourModel()

opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())

If you want to build the optimizer with parameters & configs, there's create_optimizer() API.

from pytorch_optimizer import create_optimizer

optimizer = create_optimizer(
    model,
    'adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
    use_orthograd=False,
)

Supported Optimizers

You can check the supported optimizers with below code.

from pytorch_optimizer import get_supported_optimizers

supported_optimizers = get_supported_optimizers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_optimizers

get_supported_optimizers('adam*')
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw']

get_supported_optimizers(['adam*', 'ranger*'])
# ['adamax', 'adamg', 'adammini', 'adamod', 'adamp', 'adams', 'adamw', 'ranger', 'ranger21']
OptimizerDescriptionOfficial CodePaperCitation
AdaBeliefAdapting Step-sizes by the Belief in Observed Gradientsgithubhttps://arxiv.org/abs/2010.07468cite
AdaBoundAdaptive Gradient Methods with Dynamic Bound of Learning Rategithubhttps://openreview.net/forum?id=Bkg3g2R9FXcite
AdaHessianAn Adaptive Second Order Optimizer for Machine Learninggithubhttps://arxiv.org/abs/2006.00719cite
AdamDImproved bias-correction in Adamhttps://arxiv.org/abs/2110.10828cite
AdamPSlowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weightsgithubhttps://arxiv.org/abs/2006.08217cite
diffGradAn Optimization Method for Convolutional Neural Networksgithubhttps://arxiv.org/abs/1909.11015v3cite
MADGRADA Momentumized, Adaptive, Dual Averaged Gradient Method for Stochasticgithubhttps://arxiv.org/abs/2101.11075cite
RAdamOn the Variance of the Adaptive Learning Rate and Beyondgithubhttps://arxiv.org/abs/1908.03265cite
Rangera synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizergithubhttps://bit.ly/3zyspC3cite
Ranger21a synergistic deep learning optimizergithubhttps://arxiv.org/abs/2106.13731cite
LambLarge Batch Optimization for Deep Learninggithubhttps://arxiv.org/abs/1904.00962cite
ShampooPreconditioned Stochastic Tensor Optimizationgithubhttps://arxiv.org/abs/1802.09568cite
NeroLearning by Turning: Neural Architecture Aware Optimisationgithubhttps://arxiv.org/abs/2102.07227cite
AdanAdaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Modelsgithubhttps://arxiv.org/abs/2208.06677cite
AdaiDisentangling the Effects of Adaptive Learning Rate and Momentumgithubhttps://arxiv.org/abs/2006.15815cite
SAMSharpness-Aware Minimizationgithubhttps://arxiv.org/abs/2010.01412cite
ASAMAdaptive Sharpness-Aware Minimizationgithubhttps://arxiv.org/abs/2102.11600cite
GSAMSurrogate Gap Guided Sharpness-Aware Minimizationgithubhttps://openreview.net/pdf?id=edONMAnhLu-cite
D-AdaptationLearning-Rate-Free Learning by D-Adaptationgithubhttps://arxiv.org/abs/2301.07733cite
AdaFactorAdaptive Learning Rates with Sublinear Memory Costgithubhttps://arxiv.org/abs/1804.04235cite
ApolloAn Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimizationgithubhttps://arxiv.org/abs/2009.13586cite
NovoGradStochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networksgithubhttps://arxiv.org/abs/1905.11286cite
LionSymbolic Discovery of Optimization Algorithmsgithubhttps://arxiv.org/abs/2302.06675cite
Ali-GAdaptive Learning Rates for Interpolation with Gradientsgithubhttps://arxiv.org/abs/1906.05661cite
SM3Memory-Efficient Adaptive Optimizationgithubhttps://arxiv.org/abs/1901.11150cite
AdaNormAdaptive Gradient Norm Correction based Optimizer for CNNsgithubhttps://arxiv.org/abs/2210.06364cite
RotoGradGradient Homogenization in Multitask Learninggithubhttps://openreview.net/pdf?id=T8wHz4rnuGLcite
A2GradOptimal Adaptive and Accelerated Stochastic Gradient Descentgithubhttps://arxiv.org/abs/1810.00553cite
AccSGDAccelerating Stochastic Gradient Descent For Least Squares Regressiongithubhttps://arxiv.org/abs/1704.08227cite
SGDWDecoupled Weight Decay Regularizationgithubhttps://arxiv.org/abs/1711.05101cite
ASGDAdaptive Gradient Descent without Descentgithubhttps://arxiv.org/abs/1910.09529cite
YogiAdaptive Methods for Nonconvex OptimizationNIPS 2018cite
SWATSImproving Generalization Performance by Switching from Adam to SGDhttps://arxiv.org/abs/1712.07628cite
FromageOn the distance between two neural networks and the stability of learninggithubhttps://arxiv.org/abs/2002.03432cite
MSVAGDissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradientsgithubhttps://arxiv.org/abs/1705.07774cite
AdaModAn Adaptive and Momental Bound Method for Stochastic Learninggithubhttps://arxiv.org/abs/1910.12249cite
AggMoAggregated Momentum: Stability Through Passive Dampinggithubhttps://arxiv.org/abs/1804.00325cite
QHAdamQuasi-hyperbolic momentum and Adam for deep learninggithubhttps://arxiv.org/abs/1810.06801cite
PIDA PID Controller Approach for Stochastic Optimization of Deep NetworksgithubCVPR 18cite
Gravitya Kinematic Approach on Optimization in Deep Learninggithubhttps://arxiv.org/abs/2101.09192cite
AdaSmoothAn Adaptive Learning Rate Method based on Effective Ratiohttps://arxiv.org/abs/2204.00825v1cite
SRMMStochastic regularized majorization-minimization with weakly convex and multi-convex surrogatesgithubhttps://arxiv.org/abs/2201.01652cite
AvaGradDomain-independent Dominance of Adaptive Methodsgithubhttps://arxiv.org/abs/1912.01823cite
PCGradGradient Surgery for Multi-Task Learninggithubhttps://arxiv.org/abs/2001.06782cite
AMSGradOn the Convergence of Adam and Beyondhttps://openreview.net/pdf?id=ryQu7f-RZcite
Lookaheadk steps forward, 1 step backgithubhttps://arxiv.org/abs/1907.08610cite
PNMManipulating Stochastic Gradient Noise to Improve Generalizationgithubhttps://arxiv.org/abs/2103.17182cite
GCGradient Centralizationgithubhttps://arxiv.org/abs/2004.01461cite
AGCAdaptive Gradient Clippinggithubhttps://arxiv.org/abs/2102.06171cite
Stable WDUnderstanding and Scheduling Weight Decaygithubhttps://arxiv.org/abs/2011.11152cite
Softplus TCalibrating the Adaptive Learning Rate to Improve Convergence of ADAMhttps://arxiv.org/abs/1908.00700cite
Un-tuned w/uOn the adequacy of untuned warmup for adaptive optimizationhttps://arxiv.org/abs/1910.04209cite
Norm LossAn efficient yet effective regularization method for deep neural networkshttps://arxiv.org/abs/2103.06583cite
AdaShiftDecorrelation and Convergence of Adaptive Learning Rate Methodsgithubhttps://arxiv.org/abs/1810.00143v4cite
AdaDeltaAn Adaptive Learning Rate Methodhttps://arxiv.org/abs/1212.5701v1cite
AmosAn Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scalegithubhttps://arxiv.org/abs/2210.11693cite
SignSGDCompressed Optimisation for Non-Convex Problemsgithubhttps://arxiv.org/abs/1802.04434cite
SophiaA Scalable Stochastic Second-order Optimizer for Language Model Pre-traininggithubhttps://arxiv.org/abs/2305.14342cite
ProdigyAn Expeditiously Adaptive Parameter-Free Learnergithubhttps://arxiv.org/abs/2306.06101cite
PAdamClosing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networksgithubhttps://arxiv.org/abs/1806.06763cite
LOMOFull Parameter Fine-tuning for Large Language Models with Limited Resourcesgithubhttps://arxiv.org/abs/2306.09782cite
AdaLOMOLow-memory Optimization with Adaptive Learning Rategithubhttps://arxiv.org/abs/2310.10195cite
TigerA Tight-fisted Optimizer, an optimizer that is extremely budget-consciousgithubcite
CAMEConfidence-guided Adaptive Memory Efficient Optimizationgithubhttps://aclanthology.org/2023.acl-long.243/cite
WSAMSharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Termgithubhttps://arxiv.org/abs/2305.15817cite
AidaA DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Rangegithubhttps://arxiv.org/abs/2203.13273cite
GaLoreMemory-Efficient LLM Training by Gradient Low-Rank Projectiongithubhttps://arxiv.org/abs/2403.03507cite
AdaliteAdalite optimizergithubhttps://github.com/VatsaDev/adalitecite
bSAMSAM as an Optimal Relaxation of Bayesgithubhttps://arxiv.org/abs/2210.01620cite
Schedule-FreeSchedule-Free Optimizersgithubhttps://github.com/facebookresearch/schedule_freecite
FAdamAdam is a natural gradient optimizer using diagonal empirical Fisher informationgithubhttps://arxiv.org/abs/2405.12807cite
GrokfastAccelerated Grokking by Amplifying Slow Gradientsgithubhttps://arxiv.org/abs/2405.20233cite
KateRemove that Square Root: A New Efficient Scale-Invariant Version of AdaGradgithubhttps://arxiv.org/abs/2403.02648cite
StableAdamWStable and low-precision training for large-scale vision-language modelshttps://arxiv.org/abs/2304.13013cite
AdamMiniUse Fewer Learning Rates To Gain Moregithubhttps://arxiv.org/abs/2406.16793cite
TRACAdaptive Parameter-free Optimizationgithubhttps://arxiv.org/abs/2405.16642cite
AdamGTowards Stability of Parameter-free Optimizationhttps://arxiv.org/abs/2405.04376cite
AdEMAMixBetter, Faster, Oldergithubhttps://arxiv.org/abs/2409.03137cite
SOAPImproving and Stabilizing Shampoo using Adamgithubhttps://arxiv.org/abs/2409.11321cite
ADOPTModified Adam Can Converge with Any β2 with the Optimal Rategithubhttps://arxiv.org/abs/2411.02853cite
FTRLFollow The Regularized Leaderhttps://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf
CautiousImproving Training with One Line of Codegithubhttps://arxiv.org/pdf/2411.16085v1cite
DeMoDecoupled Momentum Optimizationgithubhttps://arxiv.org/abs/2411.19870cite
MicroAdamAccurate Adaptive Optimization with Low Space Overhead and Provable Convergencegithubhttps://arxiv.org/abs/2405.15593cite
MuonMomentUm Orthogonalized by Newton-schulzgithubhttps://x.com/kellerjordan0/status/1842300916864844014cite
LaPropSeparating Momentum and Adaptivity in Adamgithubhttps://arxiv.org/abs/2002.04839cite
APOLLOSGD-like Memory, AdamW-level Performancegithubhttps://arxiv.org/abs/2412.05270cite
MARSUnleashing the Power of Variance Reduction for Training Large Modelsgithubhttps://arxiv.org/abs/2411.10438cite
SGDSaINo More Adam: Learning Rate Scaling at Initialization is All You Needgithubhttps://arxiv.org/abs/2411.10438cite
GramsGradient Descent with Adaptive Momentum Scalinghttps://arxiv.org/abs/2412.17107cite
OrthoGradGrokking at the Edge of Numerical Stabilitygithubhttps://arxiv.org/abs/2501.04697cite
Adam-ATAN2Scaling Exponents Across Parameterizations and Optimizershttps://arxiv.org/abs/2407.05872cite
SPAMSpike-Aware Adam with Momentum Reset for Stable LLM Traininggithubhttps://arxiv.org/abs/2501.06842cite
TAMTorque-Aware Momentumhttps://arxiv.org/abs/2412.18790cite
FOCUSFirst Order Concentrated Updating Schemegithubhttps://arxiv.org/abs/2501.12243cite
PSGDPreconditioned Stochastic Gradient Descentgithubhttps://arxiv.org/abs/1512.04202cite
EXAdamThe Power of Adaptive Cross-Momentsgithubhttps://arxiv.org/abs/2412.20302cite
GCSAMGradient Centralized Sharpness Aware Minimizationgithubhttps://arxiv.org/abs/2501.11584cite
LookSAMTowards Efficient and Scalable Sharpness-Aware Minimizationgithubhttps://arxiv.org/abs/2203.02714cite
SCIONTraining Deep Learning Models with Norm-Constrained LMOsgithubhttps://arxiv.org/abs/2502.07529cite
COSMOSSOAP with Muongithub
StableSPAMHow to Train in 4-Bit More Stably than 16-Bit Adamgithubhttps://arxiv.org/abs/2502.17055
AdaGCImproving Training Stability for Large Language Model Pretraininghttps://arxiv.org/abs/2502.11034cite
Simplified-AdemamixConnections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variantsgithubhttps://arxiv.org/abs/2502.02431cite
FiraCan We Achieve Full-rank Training of LLMs Under Low-rank Constraint?githubhttps://arxiv.org/abs/2410.01623cite
RACS & AliceTowards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extensionhttps://arxiv.org/pdf/2502.07752cite
VSGDVariational Stochastic Gradient Descent for Deep Neural Networksgithubhttps://openreview.net/forum?id=xu4ATNjcdycite

Supported LR Scheduler

You can check the supported learning rate schedulers with below code.

from pytorch_optimizer import get_supported_lr_schedulers

supported_lr_schedulers = get_supported_lr_schedulers()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_lr_schedulers

get_supported_lr_schedulers('cosine*')
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup']

get_supported_lr_schedulers(['cosine*', '*warm*'])
# ['cosine', 'cosine_annealing', 'cosine_annealing_with_warm_restart', 'cosine_annealing_with_warmup', 'warmup_stable_decay']
LR SchedulerDescriptionOfficial CodePaperCitation
Explore-ExploitWide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedulehttps://arxiv.org/abs/2003.03977cite
ChebyshevAcceleration via Fractal Learning Rate Scheduleshttps://arxiv.org/abs/2103.01338cite
REXRevisiting Budgeted Training with an Improved Schedulegithubhttps://arxiv.org/abs/2107.04197cite
WSDWarmup-Stable-Decay learning rate schedulergithubhttps://arxiv.org/abs/2404.06395cite

Supported Loss Function

You can check the supported loss functions with below code.

from pytorch_optimizer import get_supported_loss_functions

supported_loss_functions = get_supported_loss_functions()

or you can also search them with the filter(s).

from pytorch_optimizer import get_supported_loss_functions

get_supported_loss_functions('*focal*')
# ['bcefocalloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']

get_supported_loss_functions(['*focal*', 'bce*'])
# ['bcefocalloss', 'bceloss', 'focalcosineloss', 'focalloss', 'focaltverskyloss']
Loss FunctionsDescriptionOfficial CodePaperCitation
Label SmoothingRethinking the Inception Architecture for Computer Visionhttps://arxiv.org/abs/1512.00567cite
FocalFocal Loss for Dense Object Detectionhttps://arxiv.org/abs/1708.02002cite
Focal CosineData-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemblehttps://arxiv.org/abs/2007.07805cite
LDAMLearning Imbalanced Datasets with Label-Distribution-Aware Margin Lossgithubhttps://arxiv.org/abs/1906.07413cite
Jaccard (IOU)IoU Loss for 2D/3D Object Detectionhttps://arxiv.org/abs/1908.03851cite
Bi-TemperedThe Principle of Unchanged Optimality in Reinforcement Learning Generalizationhttps://arxiv.org/abs/1906.03361cite
TverskyTversky loss function for image segmentation using 3D fully convolutional deep networkshttps://arxiv.org/abs/1706.05721cite
Lovasz HingeA tractable surrogate for the optimization of the intersection-over-union measure in neural networksgithubhttps://arxiv.org/abs/1705.08790cite

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient ClippingGradient CentralizationSoftplus Transformation
Gradient NormalizationNorm LossPositive-Negative Momentum
Linear learning rate warmupStable weight decayExplore-exploit learning rate schedule
LookaheadChebyshev learning rate schedule(Adaptive) Sharpness-Aware Minimization
On the Convergence of Adam and BeyondImproved bias-correction in AdamAdaptive Gradient Norm Correction

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper. AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

Gradient Centralization

image

Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

Gradient Normalization

Norm Loss

image

Positive-Negative Momentum

image

Linear learning rate warmup

image

Stable weight decay

image

Explore-exploit learning rate schedule

image

Lookahead

k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is updated and substituted to the current weights every k lookahead steps (5 by default).

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules.

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

On the Convergence of Adam and Beyond

Convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients.

Improved bias-correction in Adam

With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.

Adaptive Gradient Norm Correction

Correcting the norm of a gradient in each iteration based on the adaptive training history of gradient norm.

Cautious optimizer

Updates only occur when the proposed update direction aligns with the current gradient.

Adam-ATAN2

Adam-atan2 is a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter.

Frequently asked questions

here

Visualization

here

Citation

Please cite the original authors of optimization algorithms. You can easily find it in the above table! If you use this software, please cite it below. Or you can get it from "cite this repository" button.

@software{Kim_pytorch_optimizer_optimizer_2021,
    author = {Kim, Hyeongchan},
    month = jan,
    title = {{pytorch_optimizer: optimizer & lr scheduler & loss function collections in PyTorch}},
    url = {https://github.com/kozistr/pytorch_optimizer},
    version = {3.1.0},
    year = {2021}
}

Maintainer

Hyeongchan Kim / @kozistr