Research banner

An Empirical Bayesian Kernel Density Estimator

A multiscale, rank-adapted kernel density estimator with hierarchical Dirichlet priors and Gibbs sampling

Authors

Soma S. Dhavala

Souporno Ghosh

An Empirical Bayesian Kernel Density Estimator

These notes develop a non-parametric kernel density estimator (KDE) from a Bayesian standpoint, inspired by wavelet-based KDEs in the sense of analyzing the density at different scales while weighing those scales probabilistically. The estimator generalizes the rank-adapted KDE by modeling the density at each order statistic as a finite mixture of Gaussians on a dyadic scale, with hierarchical Dirichlet priors on the mixing weights and empirically derived hyper-priors. We review the standard families of KDEs, introduce the model and its prior specifications, and derive the full conditionals and rejection-sampling steps needed for a Gibbs sampler.

Introduction

Often times there is a need to infer the true underlying probability based on the observations, such as in, including but not limited to, data-mining, optimizing the process control parameters etc., Histograms, very rudimentary empirical density estimators, divide the whole data range into either equal or unequal sub intervals (bins) and then obtain the frequency of occurrence of each bin. They could lead to completely different estimates if the bin-width, and their locations are chosen differently.

The kernel density estimators (KDEs) offer practical alternatives to histograms, providing smooth density estimates. KDEs belong to a class of non-parametric methods of estimation, which assume no fixed structure of the underlying density and completely estimate the true density based on the observations alone.

However, a fundamental challenge in KDEs is controlling the degree of smoothness that a method provides. As the degree of smoothness is very subjective, many methods were proposed which minimize certain cost functions or consider some ad hoc criteria to select the smoothening criteria.

These notes develop and investigate a type of non-parametric KDE from a Bayesian stand-point, inspired by wavelet-based KDE in the sense of analyzing the density at different scales and the scales are weighed probabilistically.

The development is organized as follows: a few types of KDEs are introduced first. The model is introduced in the next Section, along with an analysis of the choice of priors. Finally, the main results are summarized in the conclusions.

Review of KDEs

In KDE, a kernel function (essentially a smoothening function) is centered at each data point and the contribution of each data point over a local neighborhood is calculated. Thus the contribution of an observed data point \(y(i)\) to the point at which we want to evaluate the density, say, at \(x\) depends on how far apart \(y(i)\) and \(x\) are.

Formally speaking, let \(y(i)\) be i.i.d random samples from a continuous density \(f(x)\). Then, \(\hat{f}(x)\) the KDE of \(f(x)\), is given by:

\[ \hat{f}(x) = \sum_{i=1}^{N} \Phi\left(\frac{x - y(i)}{h}\right) \tag{1} \]

where \(\Phi\) is the smoothening kernel, \(y(i)\) the observed data and \(h\) the bandwidth.

The optimal bandwidth that minimizes Average Integrated Mean Square Error (AIMSE) is given by [1]:

\[ h_{\text{opt}} = 0.9\,\sigma\,N^{-0.2} \tag{2} \]

However, there may not be a single bandwidth that offers the best smoothening at all locations. Consider a case, where the data is coming from a mixed distribution having spread-out means. Then, we might need different smoothness parameters for these two different distributions. A natural modification would be to vary the bandwidth at each location, leading to an adaptive bandwidth KDE, given by:

\[ f(x) = \sum_{i=1}^{N} \Phi\left(\frac{x - y(i)}{h_i}\right) \tag{3} \]

where \(h(i)\) is the local bandwidth. A particular choice for the bandwidth is:

\[ h(i) = \frac{h_{\text{opt}}}{\hat{f}(y(i))} \tag{4} \]

The adaptive bandwidth KDE performs better than the constant bandwidth KDE in terms of AIMSE but it suffers from tail problems, i.e., it tends to produce bumpy tails. One way to account for tails, is by choosing a different kernel. Another approach is to use the asymptotic properties based on the order statistics, which would naturally consider the location and thus the tails in the KDE. The rank adaptive KDE is given by [2]:

\[ f(x) = \sum_{i=1}^{M} \Phi\left(\frac{x - x_{(i)}}{h_i}\right) \tag{5} \]

where \(x_{(i)}\) is the \(i\)th order statistic and \(N\) is the number of ordered statistics. The local bandwidths are chosen as

\[ h_i = h_{\text{opt}} \left(\frac{p_i(1 - p_i)}{f^2(x_i)}\right)^{0.5} \tag{6} \]

where \(p_i\) is the empirical c.d.f of the order statistics (several definitions exist):

\[ p_i = \frac{i}{N + 1} \tag{7} \]

and the true density is replaced by some pilot estimate, in this case a constant BW KDE with optimal bandwidth.

We can view the above equation as a mixture of some parametric densities with different scale and location parameters. We immediately observe that, at each location determined by the order statistics, we employ a particular scale. For e.g, if we choose the standard normal density as the kernel, we can view the bandwidth as the scale parameter and the KDE is in fact a finite mixture model. This might work reasonably well in many situations, but may not model concave-shaped densities (like exponential densities which have a faster decay than Gaussians), one natural choice is to consider a hierarchy of mixture densities, i.e, we assume that at each pivot, we model the component of the mixture model, as another finite mixture model. The hierarchy of mixtures thence formed could model arbitrary distributions with a greater degree of flexibility.

The interpretation of hierarchy of mixtures is not new. For example, in the wavelet-shrinkage based KDEs, one is essentially using shifted and scaled versions of the mother wavelet (a wavelet can not be a pdf because its average value is zero) to approximate the underlying function [3]. However they produce non-bona fide densities because the underlying smoothening functions are not densities.

The model is introduced in the next section.

Model

The proposed form of the KDE approximates the density at each order statistic as finite mixture of Gaussians with different scales, given by:

\[ f(x) = \sum_{i=1}^{M} \beta_i \sum_{j=1}^{J} \alpha_i^j \Phi\left(\frac{x - x_{(i)}}{h_i^j}\right) \tag{8} \]

where \(x_i\) is the \(i\)th pivot and \(j\) corresponds to a particular scale. The scales are varied on a dyadic scale with their the median being the same as the local bandwidth in rank-adapted KDE. We can view \(\beta_i\)s as the associated weights of the density at the \(i\)th pivot and \(\alpha_i^j\)s weigh the densities with different scales but at the same pivot. We constrain these weights so that the KDE is in fact a valid probability density function, given by:

\[ \begin{gathered} \beta_i,\, \alpha_i^j \in [0, 1] \\ \textstyle\sum_{i=1}^{M} \beta_i = 1 \text{ and } \sum_{j=1}^{J} \alpha_i^j = 1 \end{gathered} \tag{9} \]

We treat the weights as random quantities rather than deterministic but unknown. This philosophy allows us to incorporate subjective knowledge about the problem at hand, and more importantly, it makes drawing almost any inference about the parameter possible, like 95% posterior confidence intervals. Thanks to the advancements in MCMC methods and increased hardware efficiency. However, in the present problem, we have little knowledge about the distributional properties of the weights. Hence, we ascertain the ambiguity by placing hyper priors on them.

Following are the prior specifications for the parameters:

\[ \begin{aligned} \boldsymbol{\beta}_{|\boldsymbol{\zeta}} &\sim \text{Dirichlet}(\boldsymbol{\zeta}) \\ \boldsymbol{\zeta} &\sim \text{Multinomial}(\boldsymbol{p}) \end{aligned} \tag{10} \]

\[ \begin{aligned} \boldsymbol{\alpha}_{i|\boldsymbol{\eta}} &\sim \text{Dirichlet}(\boldsymbol{\eta}_i) \\ \boldsymbol{\eta}_i &\sim \text{Multinomial}(\boldsymbol{\lambda}_i) \end{aligned} \tag{11} \]

where \(\sum_{j=1}^{J} \lambda_i^j = 1\) and \(\sum_{i=1}^{N} w_i = 1\)

We chose Dirichlet prior since \(\beta\)s are proportions (mixing weights) and consequently the Multinomial hyper prior. Of the several choices possible, we chose the empirical pmfs as the hyper-prior constants for \(\boldsymbol{\beta}\), given by

\[ w_i = \hat{f}(x_i) \tag{12} \]

We could have been non-informative by specifying equal proportions in the hyper-prior. We derived the priors based on the data as we are dealing with potentially hundreds of parameters and we are afraid that the chains may not converge. Thus our approach is an empirical Bayesian analysis.

Likewise for \(\boldsymbol{\lambda}\), we choose:

\[ \lambda_i^j = \frac{\exp\left(-\left|j - \left(\frac{J+1}{2}\right)\right|\right)}{\sum_{k=1}^{N} \exp\left(-\left|j - \left(\frac{J+1}{2}\right)\right|\right)} \tag{13} \]

We give maximum weight to the local bandwidth suggested by the rank-adapted KDE and taper-off the weights as we move-away from the nominal value. Thus over-smoothening and bumpy tails can be avoided. However, the Bayesian set-up would allow us to pick-up those scales if the data suggests otherwise.

The choice of scales and the way the scales are chosen is also subjective. In the present analysis, we have chosen the number of scales to be five (usually an odd number, without loss of generality). We are inspired by the work in wavelet literature. Hence, we have decided in favor of varying the scales on a dyadic scale, given by:

\[ h_i^j = h_i\, 2^{\frac{J+1}{2} - j} \tag{14} \]

MCMC computations

Having set-up the model, it remains to set-up the MCMC chains. We have chosen Gibbs sampler, not due to its simplicity but rather the complexity involved in specifying a nice proposal density for Metropolis-Hastings step, which is more of an art than of a science, has led to choose the other way.

The MCMC chains can be generated using the Gibbs sampler. The algorithm is as follows:

  • step-1: for \(t = 1,2,\ldots\) (chain loop)
  • step-2: choose \(i \in [1,N]\) without replacement (pivot loop)
  • step-3: draw \(\beta_i(t) \sim p(\beta_i | \theta_c)\)
  • step-4: choose \(j \in [1,J]\) without replacement (scale loop)
  • step-5: draw \(\alpha_i^j,\, |\theta_c(t) \sim p(\alpha_i^j | \theta_c)\)
  • step-6: go to step-1 until the desired number of posterior samples are drawn

Note that in the steps two and three, we have chosen to visit the pivots and scales randomly, i.e, we jump from one pivot to another randomly and within a pivot, we visit different scales also randomly. We have decided to use this random jumps to avoid the chains getting-stuck, a phenomenon that we have observed when looped sequentially in a deterministic fashion, the usual Gibbs sampler way!

In order to implement the Gibbs sampler, we need to know the full conditionals. We first note that, the posterior is proportional to:

\[ p(\boldsymbol{\alpha}, \boldsymbol{\beta}) \propto \ell(x)\, \pi(\boldsymbol{\beta}_{|\boldsymbol{\zeta}})\, \pi(\boldsymbol{\zeta})\, \pi(\boldsymbol{\alpha}_{|\boldsymbol{\eta}})\, \pi(\boldsymbol{\eta}) \tag{15} \]

where \(\pi(.)\) representing a prior density and the likelihood is given as

\[ \ell(x) = \prod_{m=1}^{M} \sum_{i=1}^{M} \sum_{j=1}^{J} \beta_i \alpha_i^j \Phi\left(\frac{x_m - x_{(i)}}{h_i^j}\right) \tag{16} \]

If we factor-out the terms that are only dependent on a particular \(\alpha\) or \(\beta\), we would observe

\[ \ell(\alpha_{i|\theta^c}^j) \propto \prod_{m=1}^{M} \left(\alpha_i^j \gamma_{i,0}^j(m) + \gamma_{i,1}^j(m)\right) \tag{17} \]

where

\[ \gamma_{i,0}^j(m) = \beta_i \Phi\left(\frac{x_m - x_{(i)}}{h_i^j}\right) \tag{18} \]

and

\[ \gamma_{i,1}^j(m) = \sum_{\substack{k,l \\ k,l \neq \{i,j\}}} \beta_i\, \alpha_i^j\, \Phi\left(\frac{x_m - x_{(i)}}{h_i^j}\right) \tag{19} \]

where \(\theta^c\) is the set of all parameters excluding the one whose distribution we wanted to consider which should be evident from the context, for e.g., \(p(\alpha_{i|\theta^c}^j)\) is the full conditional density of \(\alpha_i^j\)

Likewise, for \(\beta\), we get,

\[ \ell(\beta_{i|\theta}) \propto \prod_{m=1}^{M} \left(\beta_i \psi_{i,0}(m) + \psi_{i,1}(m)\right) \tag{20} \]

where

\[ \psi_{i,0}(m) = \beta_i \sum_{j=1}^{J} \alpha_i^j \Phi\left(\frac{x_m - x_{(i)}}{h_i^j}\right) \tag{21} \]

\[ \psi_{i,1}^j(m) = \sum_{\substack{k,l \\ k \neq i}} \beta_i\, \alpha_i^j\, \Phi\left(\frac{x_m - x_{(i)}}{h_i^j}\right) \tag{22} \]

We can use rejection sampling to draw samples from the conditional density with the conditional Dirichlet as the proposal density, i.e.,

\[ p(\alpha_{i|\theta^c}^j) \propto \ell(\alpha_{i|\theta^c}^j)\, \pi(\alpha_{i|\theta^c}^j) \tag{23} \]

We note that the first term on the right-hand side is a polynomial of degree \(M\) in \(\alpha_i^j\) and second term is a conditional Dirichlet density. In this case, the rejection sampling reduces to the following simple algorithm, since \(\alpha_i^j \in [0, 1]\), to:

Let \(Q(\alpha_i^j)\) be the \(M\)th degree polynomial in \(\alpha_i^j\), with \(q_k\) being the coefficient associated with the term \(\alpha_i^{j\,k}\) (and \(q_{-1} = 0\)) then

  • step-1: \(u \sim U[0,1]\)
  • step-2: choose \(k \in [0, M]\) such that \(\sum_{l=-1}^{k-1} q_l < u \leq \sum_{l=-1}^{k} q_l\)
  • step-3: generate \(\alpha_i^j\) from \(\pi(\alpha_{i|\theta^c}^j)\)
  • step-4: \(u \sim U[0,1]\)
  • step-5: if \(u \leq \alpha_i^{j,k}\), accept the proposal, else go to step-3

Now, we need to be able to generate draws from the conditional Dirichlet as follows (step-3 of the above algorithm):

At a particular iteration in the MCMC chain,

  • \(\boldsymbol{\zeta} \sim \text{Multinomial}(\boldsymbol{\pi})\)
  • \(\alpha_i^j \sim \Gamma(\zeta_i^j, 1)\) (notation followed from Gelman Text Book)
  • for each \(i\), \(\alpha_i = \dfrac{\alpha_i^j}{\sum_{l=1}^{N} \alpha_i^j}\)

We can generate \(\beta_i\)s also in the same fashion.

Conclusions

The method described here generalizes the rank-adaptive KDE. The Bayesian set-up allows one to control the degree of smoothness through hyper priors. In the MCMC simulations, the chain must be traversed stochastically to avoid it getting stuck temporarily before it moves by itself. Overall, the approach is a viable candidate for further research warranting extensive simulations.

References

[1] S.J Sheather, “The performance of six popular bandwidth selection methods on some real data sets,” Comput. Statistics, vol. 7, pp. 225–250, 1992.

[2] David B. Kim., “Rank adapted kernel density estimation,” 2001.

[3] Peter Muller and Brani Vidakovic, “Bayesian inference with wavelets: Density estimation,” Journal of Computational and Graphical Statistics, vol. 7, pp. 456–468, 1998.

[4] K. Roeder and L. Wasserman, “Practical Bayesian density estimation using mixture of normals,” JASA, vol. 92, pp. 894–902, 1997.