# Are Loss Functions All the Same?

@article{Rosasco2004AreLF, title={Are Loss Functions All the Same?}, author={Lorenzo Rosasco and Ernesto de Vito and Andrea Caponnetto and Michele Piana and Alessandro Verri}, journal={Neural Computation}, year={2004}, volume={16}, pages={1063-1076} }

In this letter, we investigate the impact of choosing different loss functions from the viewpoint of statistical learning theory. We introduce a convexity assumption, which is met by all loss functions commonly used in the literature, and study how the bound on the estimation error changes with the loss. We also derive a general result on the minimizer of the expected risk for a convex loss function in the case of classification. The main outcome of our analysis is that for classification, the… Expand

#### Topics from this paper

#### 348 Citations

Making Convex Loss Functions Robust to Outliers using $e$-Exponentiated Transformation

- Mathematics, Computer Science
- ArXiv
- 2019

A novel generalization error bound is theoretically shown that the transformed loss function has a tighter bound for datasets corrupted by outliers and the empirical observation shows that the accuracy obtained can be significantly better than the same obtained using the original loss function and comparable to that obtained by some other state of the art methods in the presence of label noise. Expand

From Convex to Nonconvex: A Loss Function Analysis for Binary Classification

- Mathematics, Computer Science
- 2010 IEEE International Conference on Data Mining Workshops
- 2010

A new differentiable non-convex loss function is proposed, called smoothed 0-1 loss function, which is a natural approximation of the 0-2 loss function and is robust for those noisy data sets with many outliers. Expand

On the α-loss Landscape in the Logistic Model

- Computer Science, Mathematics
- 2020 IEEE International Symposium on Information Theory (ISIT)
- 2020

This work studies the evolution of the optimization landscape of α-loss with respect to α using tools drawn from the study of strictly-locally-quasi-convex functions in addition to geometric techniques and interprets the results in terms of optimization complexity via normalized gradient descent. Expand

High Dimensional Classification via Empirical Risk Minimization: Improvements and Optimality

- Mathematics, Computer Science
- ArXiv
- 2019

A family of classification algorithms defined by the principle of empirical risk minimization, in the high dimensional regime where the feature dimension $p$ and data number $n$ are both large and comparable are investigated. Expand

Cost-sensitive Multiclass Classification Risk Bounds

- Mathematics, Computer Science
- ICML
- 2013

A bound is developed for the case of cost-sensitive multiclass classification and a convex surrogate loss that goes back to the work of Lee, Lin and Wahba and is as easy to calculate as in binary classification. Expand

A new loss function for robust classification

- Mathematics, Computer Science
- Intell. Data Anal.
- 2014

The experimental results show that the proposed smoothed 0-1 loss function works better on data sets with noisy labels, noisy features, and outliers than several existing loss functions in the classification of noisy data sets. Expand

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

- Computer Science, Mathematics
- AISTATS
- 2019

An exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions is shown. Expand

On the alpha-loss Landscape in the Logistic Model

- Computer Science, Mathematics
- ArXiv
- 2020

This work analyzes the evolution of the optimization landscape of $\alpha$-loss with respect to $alpha$ using tools drawn from the study of strictly-locally-quasi-convex functions in addition to geometric techniques and interprets these results in terms of optimization complexity via normalized gradient descent. Expand

Learning with Non-Convex Truncated Losses by SGD

- Computer Science, Mathematics
- UAI
- 2019

A family of objective functions formed by truncating traditional loss functions, applicable to both shallow learning and deep learning are studied, which establish excess risk bounds of empirical risk minimization based on truncated losses for heavy-tailed output, and statistical error of an approximate stationary point found by stochastic gradient descent (SGD) method. Expand

Model selection and error estimation without the agonizing pain

- Computer Science
- Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
- 2018

The purpose of this review is to give an intelligible overview of the problems of model selection and error estimation, by focusing on the ideas behind the different SLT‐based approaches and simplifying most of the technical aspects with the purpose of making them more accessible and usable in practice. Expand

#### References

SHOWING 1-10 OF 26 REFERENCES

Statistical behavior and consistency of classification methods based on convex risk minimization

- Mathematics
- 2003

We study how closely the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification… Expand

Statistical Properties and Adaptive Tuning of Support Vector Machines

- Mathematics, Computer Science
- Machine Learning
- 2004

An approach to adaptively tuning the smoothing parameter(s) in the SVMs is described, based on the generalized approximate cross validation (GACV), which is an easily computable proxy of the GCKL. Expand

Scale-sensitive dimensions, uniform convergence, and learnability

- Mathematics, Computer Science
- JACM
- 1997

A characterization of learnability in the probabilistic concept model, solving an open problem posed by Kearns and Schapire, and shows that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class. Expand

The covering number in learning theory

- Computer Science, Mathematics
- J. Complex.
- 2002

This work gives estimates for the covering number of a ball of a reproducing kernel Hilbert space as a subset of the continuous function space by means of the regularity of the Mercer kernel K, and provides an example of a Mercer kernels to show that LK½ may not be generated by a Mercer kernel. Expand

On the mathematical foundations of learning

- Computer Science
- 2001

(1) A main theme of this report is the relationship of approximation to learning and the primary role of sampling (inductive inference). We try to emphasize relations of the theory of learning to the… Expand

A note on different covering numbers in learning theory

- Computer Science, Mathematics
- J. Complex.
- 2003

It is formally shown that when F is made of smooth functions, the discrete covering number is close to its continuous counterpart, and illustrated in the case that F is a ball in a reproducing kernel Hilbert space. Expand

Regularization Theory and Neural Networks Architectures

- Computer Science, Mathematics
- Neural Computation
- 1995

This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks, and introduces new classes of smoothness functionals that lead to different classes of basis functions. Expand

On the Bayes-risk consistency of regularized boosting methods

- Mathematics
- 2003

The probability of error of classification methods based on convex combinations of simple base classifiers by boosting algorithms is investigated. The main result of the paper is that certain… Expand

The Elements of Statistical Learning

- Computer Science, Mathematics
- Technometrics
- 2003

Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods. Expand

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

- Mathematics, Computer Science
- Neural Computation
- 2005

This article shows that the convergence behavior of the linear programming SVM is almost the same as that of the quadratic programming S VM, and proposes an upper bound for the misclassification error for general probability distributions. Expand