We employ the Kantorovich-Rubinstein (KR) metric and $L^p$ generalizations to compare probability distributions on a given phylogenetic tree. Such distributions arise in the context of metagenomics, where a sample of environmental sequences may be treated as a collection of weighted points on a reference phylogenetic tree of known sequences. In contrast to many applications of Kantorovich-Rubinstein ideas, the phylogenetic KR metric can be written in a closed form and calculated in linear time. Using Monte Carlo resampling of the data, we assign a statistical significance level to the observed distance between two distributions under a null hypothesis of no clustering. We also approximate the significance level using a functional of a suitable Gaussian process; in the $L^2$ generalized case this functional is distributed as a linear combination of $\chi_1^2$ random variables weighted by the eigenvalues of an associated matrix. We conclude with an example application using our software implementation of the KR metric and its generalizations.




Download Full History