Essential Loss Functions for Machine Learning

Evaluation metrics tell us whether a model is useful in the real world. Loss functions tell the optimizer how to change the model parameters. The two are inextricably linked, but they are not identical.

Business metrics like Accuracy, F1, and threshold-sweeping metrics like ROC AUC or PR AUC define our ultimate goals, but they are discrete, step-wise, and mathematically non-differentiable. A neural network cannot calculate gradients over a sorting operation or a boolean match. A loss function is, therefore, a differentiable proxy: it converts the desired system behavior into a continuous topological space where gradients can flow.

Understanding the behavior, limits, and mathematical assumptions of these proxy interfaces is critical for building robust ML systems.

1. Logits, Probabilities, and Numerical Stability

Before discussing specific loss functions, we must clarify the data types that flow into them. Many practitioners casually conflate logits with probabilities, leading to silent numerical failures in production.

Logits

Logits are the raw, unnormalized scores produced by the final linear layer of a neural network. They are not probabilities; they exist on the domain of all real numbers .

For example: logits = [2.1, -0.3, 5.7]

Probabilities

To convert these unbounded scores into a valid probability distribution, we apply a transformation function:

Binary classification: logit → sigmoid → probability

Multi-class classification: logits vector → softmax → probability distribution

The Numerical Stability Problem

The mathematical definition of softmax involves exponentiating the logits. If a model is highly confident and outputs a large logit (e.g., ), calculating will immediately cause a 64-bit floating-point overflow, resulting in NaN.

To solve this, production systems shift all input values by subtracting the maximum logit. The highest value becomes exactly 0 (), guaranteeing that the function will never overflow, without changing the resulting probability distribution.

2. Binary Cross-Entropy and Log Loss

For binary classification tasks (e.g., click-through rate prediction, spam detection), the target and the model predicts a probability p.

The theoretical Binary Cross-Entropy (BCE) formula is:

Intuition:

If the true label is 1, the right half of the equation disappears. The loss becomes . If the model predicts , the loss approaches 0. If it predicts , the penalty grows logarithmically massive.

It ruthlessly penalizes "confident but wrong" predictions.

Engineering Caveat: Fusing Sigmoid and BCE

In production, computing and then passing it to a function is an anti-pattern. If x is highly negative, p rounds down to exactly 0.0, triggering a log(0) exception. Real-world frameworks always compute BCE directly from raw logits using a mathematically simplified form.

(Note: "Log Loss" is simply the Binary Cross-Entropy averaged over an entire dataset or batch of samples.)

3. Softmax + Cross-Entropy for Multi-Class Classification

In multi-class classification, a model outputs a vector of logits (unnormalized scores) for each of the K classes. To convert these scores into probabilities, we conceptually apply softmax, and then compute the negative log-likelihood of the true class:

where:

Engineering Caveat: The Log-Sum-Exp Trick

Similar to BCE, computing softmax probabilities explicitly before applying logarithm is numerically unstable. A naive implementation of the log-sum-exp term, sum(exp(x)), can easily overflow when logits are large (e.g., exp(1000)).

To improve numerical stability, we apply the log-sum-exp trick by subtracting the maximum logit before exponentiation. This prevents overflow in the exponential computation.

However, even with softmax normalization, directly computing probabilities can still lead to numerical underflow when logits are very negative. In such cases, probabilities may become exactly zero, resulting in log(0) during cross-entropy computation and producing -inf loss values.

For this reason, modern implementations avoid explicit softmax and instead compute cross-entropy directly from logits using a fused log-sum-exp formulation.

This is equivalent to:

Compute softmax probabilities:

Select probability of true class:

Compute loss:

But it avoids explicitly forming , which improves numerical stability.

Rule of Thumb

Always train using logits, not probabilities

Only convert to probabilities for human readability, thresholding, or downstream business logic.

Softmax is conceptually part of the model, but fused into the loss function

Frameworks (PyTorch, TensorFlow) implement this as a single optimized kernel:

log-softmax + NLL loss (or equivalent fused CE)

4. Ranking Loss

Standard cross-entropy optimizes "which category does this item belong to?" In search and recommendation systems, the objective shifts. We do not care about absolute probabilities; we care about "which item should rank higher?"

Pairwise ranking loss takes two items—a positive (relevant) item and a negative (irrelevant) item—and optimizes the relative difference in their scores. Pairwise loss does not ask "is this item relevant?" It asks "is the relevant item scored higher than the irrelevant one?"

Pairwise Hinge Loss

The model must score the positive item higher than the negative item by at least a specified margin.

Bayesian Personalized Ranking (BPR)

Widely used in recommendation systems, BPR treats the score difference as a logistic classification problem.

Engineering Caveat: Softplus Implementation

While conceptually simple, a naive -math.log(sigmoid(s_pos - s_neg)) will still overflow if the difference is heavily negative. Production frameworks replace this with the Softplus function: .

5. Contrastive Loss and Representation Learning

In representation learning (e.g., training embedding models or Siamese networks), the goal is to map data into a continuous vector space where semantic similarity aligns with geometric proximity.

Similar pairs should be pulled closer together.

Dissimilar pairs should be pushed further apart.

(Where d is the Euclidean distance).

Engineering Caveat: While we define y=1 as a positive/similar pair here, some frameworks (and older metric learning literature) use the exact opposite convention (y=0 for similar, y=1 for dissimilar). Always verify the library's label convention before deploying.

6. InfoNCE: Turning Retrieval into Classification

InfoNCE (Noise-Contrastive Estimation) is the foundational loss function powering modern embedding models, contrastive learning (e.g., CLIP), and dense retrieval systems like RAG.

Imagine a batch containing one query embedding, one positive candidate embedding, and several negative candidate embeddings. InfoNCE reframes this retrieval task as an N-way softmax classification problem: can the model identify the one true positive out of the batch of negatives?

The Role of Temperature ()

Temperature scales the similarities before the softmax operation:

Lower temperature (e.g., 0.05): Makes the softmax distribution sharper. The loss heavily penalizes "hard negatives" (irrelevant items that the model thought were highly similar to the query).

Higher temperature (e.g., 1.0): Flattens the distribution, providing a smoother, more forgiving gradient signal during early training.

7. Loss Reduction

Loss is typically computed mathematically per-sample, but neural networks update their parameters based on batches. How you aggregate—or reduce—these individual losses is a critical hyperparameter.

Mean: The standard default. It keeps the gradient scale stable across varying batch sizes.

Sum: Gradients scale linearly with batch size, requiring careful learning rate adjustments if the batch size changes during training.

Weighted Mean: Crucial for imbalanced datasets or search relevance. Weighting samples (e.g., giving the minority class a higher multiplier, or weighting queries by their historical traffic volume) can alter the optimization landscape just as drastically as rewriting the loss formula itself.

8. Loss vs. Metric

A common engineering trap is witnessing the training loss smoothly decrease while the business metric flatlines. This occurs because the loss is only a differentiable proxy. A model might improve its Pairwise Hinge Loss by pushing already-correct items further apart—which mathematically lowers the loss but changes absolutely nothing about the final sorting order evaluated by NDCG.

Metrics define what we care about. Loss functions define what gradients can optimize. Aligning the two is the core challenge of system design.

Loss	Optimizes For	Typical Use Case	Key Engineering Caveat
BCE / Log Loss	Binary probability estimation	CTR, fraud detection, medical risk	Sensitive to label noise; good calibration still requires validation.
Softmax CE	Correct class probability	Classification, language modeling	Requires fused log-sum-exp implementation to avoid overflow / underflow.
Pairwise Hinge / BPR	Relative ordering	Ranking, recommendation	Highly dependent on the quality of negative sampling strategy.
Contrastive Loss	Embedding distance	Metric learning, Siamese nets	The margin choice dictates the density of the embedding space.
InfoNCE	Positive identification among negatives	Dense retrieval, representation learning	Batch size of negatives and temperature scale heavily influence convergence.

Essential Loss Functions for Machine Learning

1. Logits, Probabilities, and Numerical Stability

2. Binary Cross-Entropy and Log Loss

3. Softmax + Cross-Entropy for Multi-Class Classification

4. Ranking Loss

5. Contrastive Loss and Representation Learning

6. InfoNCE: Turning Retrieval into Classification

7. Loss Reduction

8. Loss vs. Metric

Relate Posts

Building a Minimal LLM Pipeline from Scratch

The ML Factory: Building Production ML Systems

深度学习模型架构的演进

机器学习模型：从传统算法到生成式AI

ML 模型生产全流程

模型训练的方法与实践