PyTorch BCE Loss and BCE with logits

A little finding in today’s work:

what is the difference between sigmoid layer+ BCE loss and BCE with with logits loss?

The answer is numerical stability.

There are several situations we need to consider in a softmax operation.

  1. Underflow: when a number is rounded to 0, some operations may raise an error. E.g, divide by 0.
  2. Overflow: a number is so big and is considered to be $\inf$.

Let’s check the softmax example.

$softmax(x)_i=\frac {exp(x_i)} {\sum(exp(x_j))}$

If all every $x_i = small \ constant$

$LSE(x_1, x_2,…,x_n)=log(exp(x_1)+exp(x_2)+…+exp(x_n))$