This is a cost function associated with a mixture-density network.
For background, this is inspired by Bishop's work pioneering the mixture
density network. The essence of the idea is that the cost function attempts
to model the output as if it were a mixture of gaussian probability
densities. The network attempts to converge on a cost function which
involves the negative log likelihood of the output fitting a set of data
and estimates the "alpha" contribution to of each of the distributions
the 'mu' (mean) and 'sigma' (standard deviation) of each of the
For a full description of the technique, refer to Bishop's work.
Bishop CM. Mixture density networks,
Neural Computing Research Group Report:
NCRG/94/004, Aston University, Birmingham, 1994
There is no public constructor, please use the builder to create an
approriate mixture loss function for the number of outputs and number
of mixtures you would like to fit.
Note that this means that the output
layer must provide (labelWidth+2)*mixtures output values in order to describe
the distributions of vectors of width labelWidth.
Please ensure that the size of the output layer matches the number of
Computes the aggregate score as a sum of all of the individual scores of
each of the labels against each of the outputs of the network. For
the mixture density network, this is the negative log likelihood that
the given labels fall within the probability distribution described by
the mixture of gaussians of the network output.
This method returns the score for each of the given outputs against the
given set of labels. For a mixture density network, this is done by
extracting the "alpha", "mu", and "sigma" components of each gaussian
and computing the negative log likelihood that the labels fall within
a linear combination of these gaussian distributions. The smaller
the negative log likelihood, the higher the probability that the given
labels actually would fall within the distribution. Therefore by
minimizing the negative log likelihood, we get to a position of highest
probability that the gaussian mixture explains the phenomenon.
This method returns the gradient of the cost function with respect to the
output from the previous layer. For this cost function, the gradient
is derived from Bishop's paper "Mixture Density Networks" (1994) which
gives an elegant closed-form expression for the derivatives with respect
to each of the output components.