# What is the Appropriate Loss Function for Modeling Neural Responses?

This question has bothered me occasionally over the years and I’ve never really come to what I felt was a satisfactory answer. The choice of a loss function is really a choice about the assumed noise distribution. A mean squared error loss function, for example, corresponds to an assumption that the noise in the system to be modeled is Gaussian. Noise in the firing rate of neurons is often assumed to follow a Poisson distribution.

$P\left( x;\lambda \right) =\dfrac {\lambda ^{x}e^{-\lambda }} {x!}$

In the Bayesian framework, the noise distribution is also known as the likelihood. Converting a likelihood to a loss function is just a matter of taking the negative log of the the likelihood distribution: $loss=-log\left( P\left( x;\lambda \right) \right)$

Writing this in terms of the measured firing rate, $r$, and the predicted firing rate, $\hat {r}$, and dropping constant terms, we get the loss function typically used to fit models to neural data:

$r=x,\hat {r}=\lambda$
$loss\left( r,\hat {r}\right) =\hat {r}-r\log \left(\hat {r}\right)$

The use of the Poisson distribution as a noise model carries with it the assumption that within a given time window, the variance of the firing rate is equal to the mean of the firing rate. The parameter that determines the mean and variance in the Poisson distribution is $\lambda$. So the Poisson distribution says that the higher the average firing rate in response to a given stimulus, the higher the variance of the firing rate in response to that stimulus. Yet because signal-to-noise (SNR) is determined by the ratio of the mean to the standard deviation, SNR increases as the mean firing rate $\lambda$ increases:

$SNR=\dfrac {\mu } {\sigma }=\dfrac {\lambda } {\sqrt {\lambda }}$

This all makes sense, we would want and expect higher firing rates of neurons to have higher SNR, otherwise it would be hard to transmit information and would be a tremendous waste of energy. But the relevant questions are:
1) Is the mean and variance of the firing rates of real neurons coupled as predicted by the Poisson distribution?
2) And if not, what are the consequences of the difference and how can we fix it?

The answer to the first question is a definite NO. For neurons driven by natural stimuli, their responses are clearly sub-Poisson; the variance of the firing rate is lower than the mean of the firing rate. For example, here are plots of the mean vs the variance of firing rates in each of 2000 16.6ms bins for two neurons in V1. The mean and variance seem to be linearly related, but the slope is clearly less than 1. The slope in each plot is essentially the Fano Factor for each neuron. The parameter $\alpha$, which we will make use of shortly, is the inverse of the Fano Factor, estimated by taking the median of the $\dfrac{mean}{variance}$ across all bins.

So, higher firing rates have even higher SNR than predicted by the Poisson distribution. This is good for energy use in the brain, but seems to be bad for the standard Poisson loss function. The Poisson loss function may not give enough credence to the SNR of high firing rates. Models fit with the Poisson loss function could thus be more influenced by lower firing rate time bins and less influenced by higher firing rate time bins than one would want, given that the noise is actually sub-Poisson.

So how can we fix this? Can we modify the Poisson distribution with an extra parameter,  $\alpha$, so that the variance is coupled to the mean like:
$E\left( X\right) =\lambda$
$Var\left( X\right) =\dfrac {\lambda}{\alpha}$

After playing with the equation for the Poisson distribution for a while, I worked out this approximate distribution function:

$P\left( x;\lambda ,\alpha \right) \approx \dfrac {\alpha \left( \alpha \lambda \right) ^{\alpha x}e^{-\alpha \lambda }} {\Gamma \left( \alpha x+1\right) }$

This function isn’t an exact distribution function, it doesn’t sum to exactly 1 for every setting of $\alpha$ and $\lambda$, but it is a very good approximation and has the properties $E\left( X\right)\approx\lambda$ and$Var\left( X\right) \approx \dfrac {\lambda}{\alpha}$. Furthermore, it is easy to see that when $\alpha=1$ this function reduces to the Poisson distribution function, since $\Gamma \left( x+1\right) =x!$ for all positive integers $x$.

This is nice, but it seemed a bit contrived to me. While trying to find some connection to the literature, I came across this paper which generalizes the Poisson distribution using the Mittag-Leffler function:

$E_{\alpha ,\beta }\left( \lambda \right) =\sum _{k=0}^{\infty }\dfrac {\lambda ^{k}} {\Gamma \left( \alpha k+\beta \right) }$

The Mittag-Leffler function generalizes the exponential function; the exponential corresponds to setting $\alpha = \beta =1$. And so to generalize the Poisson distribution, the authors proposed:

$P\left( x;\lambda, \alpha, \beta \right) =\dfrac {\lambda ^{x}} {E_{\alpha, \beta}(\lambda)\Gamma(\alpha x + \beta)}$

The formulation, however, breaks the connection between $\lambda$ and $E\left( X\right)$ as well as any simple connection to $Var\left( X\right)$. I found, however, that by making the replacement $\lambda \rightarrow \left( \alpha \lambda \right) ^{\alpha }$, this restores the connections $E\left( X\right)\approx\lambda$ and $Var\left( X\right) \approx \dfrac {\lambda}{\alpha}$, at least when $\beta=1$. I therefore propose this distribution as a more user friendly generalization of the Poisson distribution, using the Mittag-Leffler function:

$P\left( x;\lambda, \alpha \right) =\dfrac {(\alpha \lambda)^{\alpha x}} {E_{\alpha, 1}((\alpha \lambda)^{\alpha})\Gamma(\alpha x + 1)}$

This exact distribution relates to the approximate distribution function presented earlier, because as it turns out, the Mittag-Leffler function can in this case be well approximated as:

$E_{\alpha ,1}\left( \left( \alpha \lambda \right) ^{\alpha }\right) \approx \dfrac {e^{\alpha \lambda }} {\alpha }$

Making this substitution returns the approximate distribution function above. Now that this approximation has a little more mathematical grounding, we can return to the question of appropriate loss functions for neurons. For reference, let’s first plot the squared error loss function. We see that this loss function is minimized when a model’s prediction $\lambda$ and the data $x$ are equal.

We can turn both the exact distribution function and the approximate distribution function described above into loss functions by taking their negative log. Let’s now plot the exact distribution function’s negative log likelihood when $\alpha=2$. The 1-to-1 line where a model’s prediction $\lambda$ and the data $x$ are equal is plotted in orange and the minimum of the function, given the data, is plotted in red. If the loss function is unbiased the red line should be hidden behind the orange line, but we see that is not the case for small values. The loss function is actually minimized with slightly higher values of $\lambda$, given small values for $x$. It should also be noted that this loss function contains an infinite sum so it is pretty inconvenient to work with.Now let’s plot the approximate distribution function’s negative log likelihood when $\alpha=2$.This loss function is nearly identical, but is unbiased. It’s also much easier to deal with since there’s no infinite summation. However, after dropping constants, scaling factors and terms that don’t depend on the model parameters, we get:

$loss\left( r,\hat {r}\right) =\hat {r}-r\log \left(\hat {r}\right)$

which is just the standard Poisson loss function!

This was surprising to me at first, so I investigated a related model that has a similar linear coupling between the mean and variance, the quasi-Poisson  model. (The quasi-Poisson model also has a corresponding approximate distribution function that I found is actually less accurate is most situations than the one proposed here, but that’s really not worth getting into as we’ll see). The mean and variance in the quasi-Poison model are defined as:

$E\left( Y\right) =\mu$
$Var\left( Y\right) =\theta \mu$

And a quasi-Poison GLM is defined as:

$\mu =g^{-1}\left( X\beta \right)$

with $X$ as the independent variables and $\beta$ as the weights and $g$ is the exponential function. The weights of the quasi-Poisson model are found using iteratively weighted least squares (IWLS) with the update rule:

$\hat {\beta }^{\left[ j+1\right] }=\left( X'W^{\left[ j\right] }X\right)^{-1}X'W^{\left[ j\right] }\overline {y}^{\left[ j\right] }$

and the weighting function:

$W=diag\left( \dfrac {\mu_{1}} {\theta }\ldots \dfrac {\mu_{n}} {\theta }\right)=\dfrac {1}{\theta} diag\left(\mu_{1}\ldots\mu_{n}\right)$

However, notice that when you substitute $W$ into the update rule, all the $\theta$ terms will cancel and disappear. Thus the weights found in a quasi-Poisson model are independent of $\theta$ and are identical to what would be found with a standard Poisson model! So again it seems that as long as the variance is linearly related to the mean, the standard Poisson loss function is appropriate. Given that the mean and variance of the firing rate seem linearly related in neurons, we can continue using the standard Poisson loss function when fitting models, but with a little more confidence that we’re using the appropriate loss function for the job.