Durante this case, the activation function does not depend sopra scores of other classes mediante \(C\) more than \(C_1 = C_i\). So the gradient respect puro the each punteggio \(s_i\) mediante \(s\) will only depend on the loss given by its binary problem.

- Caffe: Sigmoid Ciclocampestre-Entropy Loss Layer
- Pytorch: BCEWithLogitsLoss
- TensorFlow: sigmoid_cross_entropy.

## Focal Loss

, from Facebook, per this paper. They claim onesto improve one-tirocinio object detectors using Focal Loss onesto train a detector they name RetinaNet. Focal loss is verso Cross-Entropy Loss that App datehookup weighs the contribution of each sample puro the loss based durante the classification error. The pensiero is that, if verso sample is already classified correctly by the CNN, its contribution esatto the loss decreases. With this strategy, they claim to solve the problem of class imbalance by making the loss implicitly focus in those problematic classes. Moreover, they also weight the contribution of each class preciso the lose durante per more explicit class balancing. They use Sigmoid activations, so Focal loss could also be considered per Binary Ciclocross-Entropy Loss. We define it for each binary problem as:

Where \((1 – s_i)\gamma\), with the focusing parameter \(\gamma >= 0\), is per modulating factor onesto scampato the influence of correctly classified samples sopra the loss. With \(\qualita = 0\), Focal Loss is equivalent to Binary Ciclocampestre Entropy Loss.

Where we have separated formulation for when the class \(C_i = C_1\) is positive or negative (and therefore, the class \(C_2\) is positive). As before, we have \(s_2 = 1 – s_1\) and \(t2 = 1 – t_1\).

The gradient gets per bit more complex coppia preciso the inclusion of the modulating factor \((1 – s_i)\gamma\) sopra the loss formulation, but it can be deduced using the Binary Ciclocross-Entropy gradient expression.

Where \(f()\) is the sigmoid function. Preciso get the gradient expression for verso negative \(C_i (t_i = 0\)), we just need onesto replace \(f(s_i)\) with \((1 – f(s_i))\) per the expression above.

Ratto that, if the modulating factor \(\genere = 0\), the loss is equivalent preciso the CE Loss, and we end up with the same gradient expression.

## Forward pass: Loss computation

Where logprobs[r] stores, a each element of the batch, the sum of the binary ciclocross entropy a each class. The focusing_parameter is \(\gamma\), which by default is 2 and should be defined as verso layer parameter in the net prototxt. The class_balances can be used esatto introduce different loss contributions per class, as they do sopra the Facebook paper.

## Backward pass: Gradients computation

In the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class \(C_p\) keeps its term sopra the loss. There is only one element of the Target vector \(t\) which is not niente \(t_i = t_p\). So discarding the elements of the summation which are niente due onesto target labels, we can write:

This would be the pipeline for each one of the \(C\) clases. We attrezzi \(C\) independent binary classification problems \((C’ = 2)\). Then we sum up the loss over the different binary problems: We sum up the gradients of every binary problem to backpropagate, and the losses preciso schermo the global loss. \(s_1\) and \(t_1\) are the conteggio and the gorundtruth label for the class \(C_1\), which is also the class \(C_i\) durante \(C\). \(s_2 = 1 – s_1\) and \(t_2 = 1 – t_1\) are the conteggio and the groundtruth label of the class \(C_2\), which is not a “class” per our original problem with \(C\) classes, but a class we create puro arnesi up the binary problem with \(C_1 = C_i\). We can understand it as per sostrato class.