Python Package quitefastmst Reference¶

`quitefastmst.knn_euclid`(X[, k, Y, ...])	Euclidean Nearest Neighbours
`quitefastmst.mst_euclid`(X[, M, algorithm, ...])	Euclidean and Mutual Reachability Minimum Spanning Trees

quitefastmst Python Package: Euclidean and Mutual Reachability Minimum Spanning Trees

For best speed, consider building the package from sources using, e.g., -O3 -march=native compiler flags and with OpenMP support on.

quitefastmst.knn_euclid(X, k=1, Y=None, algorithm='auto', max_leaf_size=0, squared=False, verbose=False)¶

Euclidean Nearest Neighbours

If Y is None, then the function determines the first k nearest neighbours of each point in X with respect to the Euclidean distance. It is assumed that each query point is not its own neighbour.

Otherwise, for each point in Y, this function determines the k nearest points thereto from X.

Parameters:

Xmatrix, shape (n,d): the n input points in \(\mathbb{R}^d\) (the “database”)
kint < n: requested number of nearest neighbours (should be rather small, say, ≤ 20)
YNone or an ndarray, shape (m,d): the “query points”; note that setting Y=X, contrary to Y=None, will include the query points themselves amongst their own neighbours
algorithm{"auto", "kd_tree", "brute"}, default "auto": K-d trees can only be used for d between 2 and 20. "auto" selects "kd_tree" in low-dimensional spaces
max_leaf_sizeint: maximal number of points in the K-d tree leaves; smaller leaves use more memory, yet are not necessarily faster; use 0 to select the default value, currently set to 32.
squaredFalse: whether the output dist should use the squared Euclidean distance
verbose: bool: whether to print diagnostic messages

Returns:

pairtuple

A pair (dist, ind) representing the k-nearest neighbour graph, where:

dista c_contiguous ndarray, shape (n,k) or (m,k)
dist[i,:] is sorted nondecreasingly for all i, dist[i,j] gives the weight of the edge {i, ind[i,j]}, i.e., the distance between the i-th point and its j-th NN

inda c_contiguous ndarray of the same shape
ind[i,j] is the index (between 0 and n-1) of the j-th nearest neighbour of i

Notes

The implemented algorithms, see the algorithm parameter, assume that k is rather small; say, k ≤ 20.

Our implementation of K-d trees [1] has been quite optimised; amongst others, it has good locality of reference (at the cost of making a copy of the input dataset), features the sliding midpoint (midrange) rule suggested in [2], node pruning strategies inspired by some ideas from [3], and a couple of further tuneups proposed by the current author. Still, it is well-known that K-d trees perform well only in spaces of low intrinsic dimensionality. Thus, due to the so-called curse of dimensionality, for high d, the brute-force algorithm is preferred.

The number of threads used is controlled via the OMP_NUM_THREADS environment variable or via quitefastmst.omp_set_num_threads at runtime. For best speed, consider building the package from sources using, e.g., -O3 -march=native compiler flags.

References

[1]

J.L. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM 18(9), 509–517, 1975, https://doi.org/10.1145/361002.361007

[2]

S. Maneewongvatana, D.M. Mount, It’s okay to be skinny, if your friends are fat, 4th CGC Workshop on Computational Geometry, 1999

[3]

N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search strategies in K-d Trees, 5th WSES/IEEE Conf. on Circuits, Systems, Communications & Computers (CSCC’01), 2001

quitefastmst.mst_euclid(X, M=1, algorithm='auto', max_leaf_size=0, first_pass_max_brute_size=0, mutreach_adj=-1.00000011920928955078125, verbose=False)¶

Euclidean and Mutual Reachability Minimum Spanning Trees

The function determines the/a(*) minimum spanning tree (MST) of a set of n points, i.e., an acyclic undirected connected graph whose: vertices represent the points, edges are weighted by the distances between point pairs, and have minimal total weight.

MSTs have many uses in, amongst others, topological data analysis (clustering, density estimation, dimensionality reduction, outlier detection, etc.).

For M ≤ 2, we get a spanning tree that minimises the sum of Euclidean distances between the points, i.e., the classic Euclidean minimum spanning tree (EMST). If M = 2, the function additionally returns the distance to each point’s nearest neighbour.

If M > 2, the spanning tree is the smallest wrt the degree-M mutual reachability distance [9] given by \(d_M(i, j)=\max\{ c_M(i), c_M(j), d(i, j)\}\), where \(d(i,j)\) is the Euclidean distance between the i-th and the j-th point, and \(c_M(i)\) is the i-th M-core distance defined as the distance between the i-th point and its (M-1)-th nearest neighbour (not including the query point itself).

Parameters:

Xmatrix, shape (n,d): the n input points in \(\mathbb{R}^d\)
Mint < n: the degree of the mutual reachability distance (should be rather small, say, ≤ 20). M ≤ 2 denotes the ordinary Euclidean distance
algorithm{"auto", "single_kd_tree", "sesqui_kd_tree", "dual_kd_tree", "brute"}, default "auto": K-d trees can only be used for d between 2 and 20 only. "auto" selects "sesqui_kd_tree" for d ≤ 20. "brute" is used otherwise
max_leaf_sizeint: maximal number of points in the K-d tree leaves; smaller leaves use more memory, yet are not necessarily faster; use 0 to select the default value, currently set to 32 for the single-tree and sesqui-tree and 8 for the dual-tree Borůvka algorithm
first_pass_max_brute_sizeint: minimal number of points in a node to treat it as a leaf (unless it actually is a leaf) in the first iteration of the algorithm; use 0 to select the default value, currently set to 32
mutreach_adjfloat: adjustment for mutual reachability distance ambiguity (for M>2) whose fractional part should be close to 0: values in (-1,0) prefer connecting to farther NNs, values in (0, 1) fall for closer NNs (which is what many other implementations provide), values in (-2,-1) prefer connecting to points with smaller core distances, values in (1, 2) favour larger core distances; see above for more details
verbose: bool: whether to print diagnostic messages

Returns:

tuple

If M = 1, a pair (mst_dist, mst_index) defining the n-1 edges of the computed spanning tree is returned:

mst_distan array of length (n-1)
mst_dist[i] gives the weight of the i-th edge

mst_indexa matrix with n-1 rows and 2 columns
{mst_index[i,0], mst_index[i,1]} defines the i-th edge of the tree

The tree edges are ordered w.r.t. weights nondecreasingly, and then by the indexes (lexicographic ordering of the (weight, index1, index2) triples). For each i, it holds mst_index[i,0]<mst_index[i,1].

For M > 1, we additionally get:

nn_distan n by M-1 matrix
it gives the distances between each point and its M-1 nearest neighbours

nn_indexa matrix of the same shape
it provides the corresponding indexes of the neighbours

Notes

(*) We note that if there are many pairs of equidistant points, there can be many minimum spanning trees. In particular, it is likely that there are point pairs with the same mutual reachability distances.

To make the definition less ambiguous (albeit with no guarantees), internally, the brute-force algorithm relies on the adjusted distance: \(d_M(i, j)=\max\{c_M(i), c_M(j), d(i, j)\}+\varepsilon d(i, j)\) or \(d_M(i, j)=\max\{c_M(i), c_M(j), d(i, j)\}-\varepsilon \min\{c_M(i), c_M(j)\}\), where ε is close to 0. |mutreach_adj| < 1 selects the former formula (ε=mutreach_adj) whilst 1 < |mutreach_adj| < 2 chooses the latter (ε=mutreach_adj±1).

For the K-d tree-based methods, on the other hand, mutreach_adj indicates the preference towards connecting to farther/closer points wrt the original metric or having smaller/larger core distances if a point i has multiple nearest-neighbour candidates j’, j’’ with \(c_M(i) \geq \max\{d(i, j'), c_M(j')\}\) and \(c_M(i) \geq \max\{d(i, j''), c_M(j'')\}\). Generally, the smaller the mutreach_adj, the more leaves there should be in the tree (note that there are only four types of adjustments, though).

The implemented algorithms, see the algorithm parameter, assume that M is rather small; say, M ≤ 20.

Our implementation of K-d trees [6] has been quite optimised; amongst others, it has good locality of reference (at the cost of making a copy of the input dataset), features the sliding midpoint (midrange) rule suggested in [7], node pruning strategies inspired by some ideas from [8], and a couple of further tuneups proposed by the current author.

The “single-tree” version of the Borůvka algorithm is naively parallelisable: in every iteration, it seeks each point’s nearest “alien”, i.e., the nearest point thereto from another cluster. The “dual-tree” Borůvka version of the algorithm is, in principle, based on [5]. As far as our implementation is concerned, the dual-tree approach is often only faster in 2- and 3-dimensional spaces, for M ≤ 2, and in a single-threaded setting. For another (approximate) adaptation of the dual-tree algorithm to the mutual reachability distance, see [11].

The “sesqui-tree” variant (by the current author) is a mixture of the two approaches: it compares leaves against the full tree. It is usually faster than the single- and dual-tree methods in very low dimensional spaces and usually not much slower than the single-tree variant otherwise.

Nevertheless, it is well-known that K-d trees perform well only in spaces of low intrinsic dimensionality (a.k.a. the “curse”). For high d, the “brute-force” algorithm is recommended. Here, we provided a parallelised [2] version of the Jarník [1] (a.k.a. Prim [3] or Dijkstra) algorithm, where the distances are computed on the fly (only once for M ≤ 2).

The number of threads is controlled via the OMP_NUM_THREADS environment variable or via quitefastmst.omp_set_num_threads at runtime. For best speed, consider building the package from sources using, e.g., -O3 -march=native compiler flags.

References

[1]

V. Jarník, O jistém problému minimálním, Práce Moravské Přírodovědecké Společnosti 6, 1930, 57–63

[2]

C.F. Olson, Parallel algorithms for hierarchical clustering, Parallel Computing 21(8), 1995, 1313–1325, https://doi.org/10.1016/0167-8191(95)00017-I

[3]

R. Prim, Shortest connection networks and some generalizations, The Bell System Technical Journal 36(6), 1957, 1389–1401, https://doi.org/10.1002/j.1538-7305.1957.tb01515.x

[4]

O. Borůvka, O jistém problému minimálním, Práce Moravské Přírodovědecké Společnosti 3, 1926, 37–58

[5]

W.B. March, R. Parikshit, A.G. Gray, Fast Euclidean minimum spanning tree: Algorithm, analysis, and applications, Proc. 16th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining (KDD ‘10), 2010, 603–612

[6]

J.L. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM 18(9), 509–517, 1975, https://doi.org/10.1145/361002.361007

[7]

S. Maneewongvatana, D.M. Mount, It’s okay to be skinny, if your friends are fat, The 4th CGC Workshop on Computational Geometry, 1999

[8]

N. Sample, M. Haines, M. Arnold, T. Purcell, Optimizing search strategies in K-d Trees, 5th WSES/IEEE Conf. on Circuits, Systems, Communications & Computers (CSCC’01), 2001

[9]

R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160–172, https://doi.org/10.1007/978-3-642-37456-2_14

[10]

R.J.G.B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data (TKDD) 10(1), 2015, 1–51, https://doi.org/10.1145/2733381

[11]

L. McInnes, J. Healy, Accelerated hierarchical density-based clustering, IEEE Intl. Conf. Data Mining Workshops (ICMDW), 2017, 33–42, https://doi.org/10.1109/ICDMW.2017.12