Evaluation metrics tell us whether a model is useful in the real world. Loss functions tell the optimizer how to change the model parameters. The two are inextricably linked, but they are not identical.
Business metrics like Accuracy, F1, and threshold-sweeping metrics like ROC AUC or PR AUC define our ultimate goals, but they are discrete, step-wise, and mathematically non-differentiable. A neural network cannot calculate gradients over a sorting operation or a boolean match. A loss function is, therefore, a differentiable proxy: it converts the desired system behavior into a continuous topological space where gradients can flow.
Understanding the behavior, limits, and mathematical assumptions of these proxy interfaces is critical for building robust ML systems.
1. Logits, Probabilities, and Numerical Stability
Before discussing specific loss functions, we must clarify the data types that flow into them. Many practitioners casually conflate logits with probabilities, leading to silent numerical failures in production.
Logits
Logits are the raw, unnormalized scores produced by the final linear layer of a neural network. They are not probabilities; they exist on the domain of all real numbers .
For example:
logits = [2.1, -0.3, 5.7]Probabilities
To convert these unbounded scores into a valid probability distribution, we apply a transformation function:
- Binary classification: logit → sigmoid → probability
- Multi-class classification: logits vector → softmax → probability distribution
The Numerical Stability Problem
The mathematical definition of softmax involves exponentiating the logits. If a model is highly confident and outputs a large logit (e.g., ), calculating will immediately cause a 64-bit floating-point overflow, resulting in
NaN.To solve this, production systems shift all input values by subtracting the maximum logit. The highest value becomes exactly
0 (), guaranteeing that the function will never overflow, without changing the resulting probability distribution.2. Binary Cross-Entropy and Log Loss
For binary classification tasks (e.g., click-through rate prediction, spam detection), the target and the model predicts a probability
p.The theoretical Binary Cross-Entropy (BCE) formula is:
Intuition:
- If the true label is
1, the right half of the equation disappears. The loss becomes . If the model predicts , the loss approaches0. If it predicts , the penalty grows logarithmically massive.
- It ruthlessly penalizes "confident but wrong" predictions.
Engineering Caveat: Fusing Sigmoid and BCE
In production, computing and then passing it to a function is an anti-pattern. If
x is highly negative, p rounds down to exactly 0.0, triggering a log(0) exception. Real-world frameworks always compute BCE directly from raw logits using a mathematically simplified form.(Note: "Log Loss" is simply the Binary Cross-Entropy averaged over an entire dataset or batch of samples.)
3. Softmax + Cross-Entropy for Multi-Class Classification
In multi-class classification, a model outputs a vector of logits (unnormalized scores) for each of the
K classes. To convert these scores into probabilities, we conceptually apply softmax, and then compute the negative log-likelihood of the true class:where:
Engineering Caveat: The Log-Sum-Exp Trick
Similar to BCE, computing softmax probabilities explicitly before applying logarithm is numerically unstable. A naive implementation of the log-sum-exp term,
sum(exp(x)), can easily overflow when logits are large (e.g., exp(1000)).To improve numerical stability, we apply the log-sum-exp trick by subtracting the maximum logit before exponentiation. This prevents overflow in the exponential computation.
However, even with softmax normalization, directly computing probabilities can still lead to numerical underflow when logits are very negative. In such cases, probabilities may become exactly zero, resulting in
log(0) during cross-entropy computation and producing -inf loss values.For this reason, modern implementations avoid explicit softmax and instead compute cross-entropy directly from logits using a fused log-sum-exp formulation.
This is equivalent to:
- Compute softmax probabilities:
- Select probability of true class:
- Compute loss:
But it avoids explicitly forming , which improves numerical stability.
Rule of Thumb
- Always train using logits, not probabilities
Only convert to probabilities for human readability, thresholding, or downstream business logic.
- Softmax is conceptually part of the model, but fused into the loss function
- Frameworks (PyTorch, TensorFlow) implement this as a single optimized kernel:
log-softmax + NLL loss (or equivalent fused CE)4. Ranking Loss
Standard cross-entropy optimizes "which category does this item belong to?" In search and recommendation systems, the objective shifts. We do not care about absolute probabilities; we care about "which item should rank higher?"
Pairwise ranking loss takes two items—a positive (relevant) item and a negative (irrelevant) item—and optimizes the relative difference in their scores. Pairwise loss does not ask "is this item relevant?" It asks "is the relevant item scored higher than the irrelevant one?"
Pairwise Hinge Loss
The model must score the positive item higher than the negative item by at least a specified margin.
Bayesian Personalized Ranking (BPR)
Widely used in recommendation systems, BPR treats the score difference as a logistic classification problem.
Engineering Caveat: Softplus Implementation
While conceptually simple, a naive
-math.log(sigmoid(s_pos - s_neg)) will still overflow if the difference is heavily negative. Production frameworks replace this with the Softplus function: .5. Contrastive Loss and Representation Learning
In representation learning (e.g., training embedding models or Siamese networks), the goal is to map data into a continuous vector space where semantic similarity aligns with geometric proximity.
- Similar pairs should be pulled closer together.
- Dissimilar pairs should be pushed further apart.
(Where
d is the Euclidean distance).Engineering Caveat: While we define
y=1 as a positive/similar pair here, some frameworks (and older metric learning literature) use the exact opposite convention (y=0 for similar, y=1 for dissimilar). Always verify the library's label convention before deploying.6. InfoNCE: Turning Retrieval into Classification
InfoNCE (Noise-Contrastive Estimation) is the foundational loss function powering modern embedding models, contrastive learning (e.g., CLIP), and dense retrieval systems like RAG.
Imagine a batch containing one query embedding, one positive candidate embedding, and several negative candidate embeddings. InfoNCE reframes this retrieval task as an N-way softmax classification problem: can the model identify the one true positive out of the batch of negatives?
The Role of Temperature ()
Temperature scales the similarities before the softmax operation:
- Lower temperature (e.g., 0.05): Makes the softmax distribution sharper. The loss heavily penalizes "hard negatives" (irrelevant items that the model thought were highly similar to the query).
- Higher temperature (e.g., 1.0): Flattens the distribution, providing a smoother, more forgiving gradient signal during early training.
7. Loss Reduction
Loss is typically computed mathematically per-sample, but neural networks update their parameters based on batches. How you aggregate—or reduce—these individual losses is a critical hyperparameter.
- Mean: The standard default. It keeps the gradient scale stable across varying batch sizes.
- Sum: Gradients scale linearly with batch size, requiring careful learning rate adjustments if the batch size changes during training.
- Weighted Mean: Crucial for imbalanced datasets or search relevance. Weighting samples (e.g., giving the minority class a higher multiplier, or weighting queries by their historical traffic volume) can alter the optimization landscape just as drastically as rewriting the loss formula itself.
8. Loss vs. Metric
A common engineering trap is witnessing the training loss smoothly decrease while the business metric flatlines. This occurs because the loss is only a differentiable proxy. A model might improve its Pairwise Hinge Loss by pushing already-correct items further apart—which mathematically lowers the loss but changes absolutely nothing about the final sorting order evaluated by NDCG.
Metrics define what we care about. Loss functions define what gradients can optimize. Aligning the two is the core challenge of system design.
Loss | Optimizes For | Typical Use Case | Key Engineering Caveat |
BCE / Log Loss | Binary probability estimation | CTR, fraud detection, medical risk | Sensitive to label noise; good calibration still requires validation. |
Softmax CE | Correct class probability | Classification, language modeling | Requires fused log-sum-exp implementation to avoid overflow / underflow. |
Pairwise Hinge / BPR | Relative ordering | Ranking, recommendation | Highly dependent on the quality of negative sampling strategy. |
Contrastive Loss | Embedding distance | Metric learning, Siamese nets | The margin choice dictates the density of the embedding space. |
InfoNCE | Positive identification among negatives | Dense retrieval, representation learning | Batch size of negatives and temperature scale heavily influence convergence. |
- Author:Fan Luo
- URL:https://fanluo.me/article/essential-loss-functions-for-machine-learning
- Copyright:All articles in this blog adopt BY-NC-SA agreement. Please indicate the source!
上一篇
Retrieval-Augmented Generation (RAG)
下一篇
Statistical Tests by Data Type
