Elnur Gasanov

I am a fourth-year PhD student in the Optimization and Machine Learning lab at the Artificial Intelligence Initiative at King Abdullah University of Science and Technology, advised by Professor Peter Richtárik. My research interests include distributed machine learning, stochastic optimization, and randomized linear algebra.

I earned my bachelor's degree in Applied Mathematics and Physics from Moscow Institute of Physics and Technology.

Email  /  CV  /  LinkedIn  /  GitHub

profile photo

Talks and poster presentations

  • In July 2023, at the Megadata Federated Learning course, I gave a talk on "FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning".
  • In December 2021, at the NFFL 2021 workshop, I presented a poster on "FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning" (poster).
  • In July 2021, at "Optimization without Borders" conference, I presented our poster "FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning" (poster, photo).
  • In November 2019, at KAUST-Tsinghua-Industry workshop, I presented a poster, based on our NeurIPS 2018 paper (poster here).
  • In June 2019, at DS3, I presented our poster "A New Randomized Method for Solving Large Linear Systems".
  • In June 2019, at Laboratoire Jean Kuntzmann, I gave a talk on our new asynchronous delay-tolerant distributed algorithm.
  • In February 2018, at the Optimization and Big Data workshop, our team presented a poster on randomized linear algebra (poster).


Error Feedback Shines when Features are Rare
We provide the first proof that gradient descent (GD) with greedy sparsification (TopK) and error feedback (EF) can obtain better communication complexity than vanilla GD when solving the distributed optimization problem min f(x) = 1/nfi(x), where n = # of clients, d = # of features, and f1, …, fn are smooth nonconvex functions. Despite intensive research since 2014 when EF was first proposed by Seide et al., this problem remained open until now. Perhaps surprisingly, we show that EF shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector such that ∥∇f()∥2 ≤ ε in expectation, GD with the Top1 sparsifier and EF requires O((L+rc/nmin (c/nmaxiLi2,1/nLi2))1/ε) bits to be communicated by each worker to the server only, where L is the smoothness constant of f, Li is the smoothness constant of fi, c is the maximal number of clients owning any feature (1 ≤ c ≤ n), and r is the maximal number of features owned by any client (1 ≤ r ≤ d). Clearly, the communication complexity improves as c decreases (i.e., as features become more rare), and can be much better than the O(rL/ε) communication complexity of GD in the same regime.
Peter Richtárik, Elnur Gasanov, Konstantin Burlachenko

Understanding Progressive Training Through the Framework of Randomized Coordinate Descent
We propose a Randomized Progressive Training algorithm (RPT) -- a stochastic proxy for the well-known Progressive Training method (PT) (Karras et al., 2017). Originally designed to train GANs (Goodfellow et al., 2014), PT was proposed as a heuristic, with no convergence analysis even for the simplest objective functions. On the contrary, to the best of our knowledge, RPT is the first PT-type algorithm with rigorous and sound theoretical guarantees for general smooth objective functions. We cast our method into the established framework of Randomized Coordinate Descent (RCD) (Nesterov, 2012; Richtárik & Takáč, 2014), for which (as a by-product of our investigations) we also propose a novel, simple and general convergence analysis encapsulating strongly-convex, convex and nonconvex objectives. We then use this framework to establish a convergence theory for RPT. Finally, we validate the effectiveness of our method through extensive computational experiments.
Rafal Szlendak, Elnur Gasanov, Peter Richtárik

Adaptive Compression for Communication-Efficient Distributed Training
We propose Adaptive Compressed Gradient Descent (AdaCGD) - a novel optimization algorithm for communication-efficient training of supervised machine learning models with adaptive compression level. Our approach is inspired by the recently proposed three point compressor (3PC) framework of Richtarik et al. (2022), which includes error feedback (EF21), lazily aggregated gradient (LAG), and their combination as special cases, and offers the current state-of-the-art rates for these methods under weak assumptions. While the above mechanisms offer a fixed compression level, or adapt between two extremes only, our proposal is to perform a much finer adaptation. In particular, we allow the user to choose any number of arbitrarily chosen contractive compression mechanisms, such as Top-K sparsification with a user-defined selection of sparsification levels K, or quantization with a user-defined selection of quantization levels, or their combination. AdaCGD chooses the appropriate compressor and compression level adaptively during the optimization process. Besides i) proposing a theoretically-grounded multi-adaptive communication compression mechanism, we further ii) extend the 3PC framework to bidirectional compression, i.e., we allow the server to compress as well, and iii) provide sharp convergence bounds in the strongly convex, convex and nonconvex settings. The convex regime results are new even for several key special cases of our general mechanism, including 3PC and EF21. In all regimes, our rates are superior compared to all existing adaptive compression methods.
Maksim Makarenko, Elnur Gasanov, Rustem Islamov, Abdurakhmon Sadiev, Peter Richtárik

3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation
We propose and study a new class of gradient communication mechanisms for communication-efficient training—three point compressors (3PC)—as well as efficient distributed nonconvex optimization algorithms that can take advantage of them. Unlike most established approaches, which rely on a static compressor choice (e.g., Top-𝐾), our class allows the compressors to evolve throughout the training process, with the aim of improving the theoretical communication complexity and practical efficiency of the underlying methods. We show that our general approach can recover the recently proposed state-of-the-art error feedback mechanism EF21 (Richtarik et al., 2021) and its theoretical properties as a special case, but also leads to a number of new efficient methods. Notably, our approach allows us to improve upon the state-of-the-art in the algorithmic and theoretical foundations of the lazy aggregation literature (Chen et al., 2018). As a by-product that may be of independent interest, we provide a new and fundamental link between the lazy aggregation and error feedback literature. A special feature of our work is that we do not require the compressors to be unbiased.
Peter Richtárik, Igor Sokolov, Ilyas Fatkhullin, Elnur Gasanov, Zhize Li, Eduard Gorbunov
ICML 2022

FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning
Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learn- ing is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FedMix, that takes into account the unique challenges brought by federated learning. FedMix has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FedMix does not re- quire the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FedMix formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.
Elnur Gasanov, Ahmed Khaled, Samuel Horváth, Peter Richtárik

Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex Decentralized Optimization over Time-Varying Networks
We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network whose links are allowed to change in time. We solve two fundamental problems for this task. First, we establish the first lower bounds on the number of decentralized communication rounds and the number of local computations required to find an ε-accurate solution. Second, we design two optimal algorithms that attain these lower bounds: (i) a variant of the recently proposed algorithm ADOM (Kovalev et al., 2021) enhanced via a multi-consensus subroutine, which is optimal in the case when access to the dual gradients is assumed, and (ii) a novel algorithm, called ADOM+, which is optimal in the case when access to the primal gradients is assumed. We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods.
Dmitry Kovalev, Elnur Gasanov, Peter Richtárik, Alexander Gasnikov
NeurIPS 2021

From Local SGD to Local Fixed-Point Methods for Federated Learning
Most algorithms for solving optimization problems or finding saddle points of convex-concave functions are fixed-point algorithms. In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting. Our work is motivated by the needs of federated learning. In this context, each local operator models the computations done locally on a mobile device. We investigate two strategies to achieve such a consensus: one based on a fixed number of local steps, and the other based on randomized computations. In both cases, the goal is to limit communication of the locally-computed variables, which is often the bottleneck in distributed frameworks. We perform convergence analysis of both methods and conduct a number of experiments highlighting the benefits of our approach.
Grigory Malinovsky, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, Peter Richtárik
ICML 2020

Stochastic Spectral and Conjugate Descent Methods
The state-of-the-art methods for solving optimization problems in big dimensions are variants of randomized coordinate descent (RCD). In this paper we introduce a fundamentally new type of acceleration strategy for RCD based on the augmenta- tion of the set of coordinate directions by a few spectral or conjugate directions. As we increase the number of extra directions to be sampled from, the rate of the method improves, and interpolates between the linear rate of RCD and a linear rate independent of the condition number. We develop and analyze also inexact variants of these methods where the spectral and conjugate directions are allowed to be approximate only. We motivate the above development by proving several negative results which highlight the limitations of RCD with importance sampling.
Dmitry Kovalev, Eduard Gorbunov, Elnur Gasanov, Peter Richtárik
NeurIPS 2018

Creation of approximating scalogram description in a problem of movement prediction [in Russian]
The paper addresses the problem of a thumb movement prediction using electrocorticographic (ECoG) activity. The task is to predict thumb positions from the voltage time series of cortical activity. The scalograms are used as input features to this regression problem. Scalograms are generated by the spatio-spectro-temporal integration of voltage time series across multiple cortical areas. To reduce the dimension of a feature space, local approximation is used: every scalogram is approximated by parametric model. The predictions are obtained with partial least squares regression applied to local approximation parameters. Local approximation of scalograms does not significantly lower the quality of prediction while it efficiently reduces the dimension of feature space.
Elnur Gasanov, Motrenko Anastasia
Journal of Machine Learning and Data Analysis (in Russian)

This guy makes a cool webpage.