问题描述:

I have a set of points that are placed in 2 dimensional Euclidean space. If the distance between two points is closer than a predetermined cut-off radius, then they are considered neighbours. I would like to calculate the number of elements in contiguous regions (for example, let's say 1 and 2 are neighbours. If 2 and 3 are also neighbours, as well as 3 and 4, then we have 4 elements that are next to each other).

I have been Googling like hell all day and I have run into a lot of keywords like kd-tree, k means clustering, graph traversal, flood fill and connected-component analysis, but I can't learn them all at this point. I just need some direction to focus my efforts in the correct direction.

Your problem seems like a connected component analysis!

The first thing is to represent your data as a graph and there are tons of libraries for that. Let's take a example in python: you can use Networkx or Graph-tool. It is straightforward as a point can be represented as a node. Concerning the edges you basically have several solutions.

You can do a brute-force comparisons that compares all points to each others. It will run in O(n * (n-1) /2), ok if the dataset is small. If the dataset is big you can use different approximate algorithms (KD-tree, ball-tree), implemented in Flann or scikit-learn. Note that scikit-learn also give you the brute-force comparison.

The goal here is to create an edge (a link) between two nodes (points) if the distance is lower than your threshold.

Once you add the nodes and the edges in the graph, you can run a connected component algorithm which gives all connected components of the graph. A connected component is basically a continuous set of nodes linked by edges.

Note that once your problem is represented as a graph, almost 70 years of graph theory comes to the rescue. You can check if some regions have the same shape as others using subgraph isomorphism. You can check the importance of each point in each component with centrality measures. You can also partition a continuous into subregions without the need to threshold the edges using graph partitioning.

You can have live example of a continuous graph cut into multiple partitions on my blog.

For the sake of the argument: here is an example of several connected components belonging in the same graph (with self-edges).

It is clearly a graph theory problem. You have a graph whose nodes consist of the points and in which two nodes are connected by an edge if their distance is within a given threshold. You are looking for the sizes of the various connected components. A breadth-first search is the method of choice (https://en.wikipedia.org/wiki/Breadth-first_search) for enumerating such components though if the number of nodes is small enough you could use Warshall's algorithm on the corresponding adjacency matrix.

What you are describing sounds like the DBSCAN algorithm. I would check out the link for a description. It will require you defining the distance to neighbors and the minimum number of neighbors needed.

Also, the DBSCAN algorithm is equivalent finding the connected components of a graph where the graph is formed by connecting nodes within a certain distance if that node has a minimum number of neighbors. You can solve the graph connected component problem using BFS or DFS algorithms.

If the number of points is low, just loop over all pairs of points, test their distances, and if the distance is less than the threshold, associate those two points in a Union-Find data structure.

To speed this up, you can do an initial pass over the points that partitions the points into buckets (each bucket having height and width equal to the threshold); then for each point you don't need to compare it against every other point- just every other point in the surrounding 3x3 neighborhood of buckets.

It may be possible to get a further speedup from the fact that, if the points were partitioned into buckets of size (threshold*sqrt(2)/2), all points in the same bucket must necessarily be associated (no explicit distance check required) though with those smaller buckets you have to check a larger (5x5) surrounding neighborhood for possible close-enough points.