Just keep in mind while there is no API key required for OSM's Nominatum, there are no guarantees and they explicitly discourage bulk geocoding, with no clarification on what is bulk. [1] The lack of clarification seems intentional, which is understandable.
geocode-sqlite does support non-OSM hosted instances of Nominatum so find an alternative if you have large or regular geocoding tasks.
It's called Nominatim, not Nominatum. If you run into issues then reporting it on github repository might be the best option. It's not clear if the HN user is the author of the library.
The name is misleading to me. The name implies you can geocode from sqlite querying but this is basically a python package that uses sqlite as the data store for locations to be geocoded and where the results are stored.
I recently ran into the problem of how to efficiently filter a list of geo-locations based on a specific distance to a given geo-location. I could not find any discussion anywhere about possible algorithms. Does anyone have a recommendation (article, book, Web-site)?
If approximate distances (with known error bounds that depend on lat/long) is appropriate then string prefix matching on GeoHash[1] is easy and quite beautiful in many ways.
Basically locations within the same X character string code will be within a known lat/long square of each other. The longer the string the smaller the lat/long square.
The SpatiaLite extension for SQLite has a KNN implementation that is designed to handle this problem. Here are some notes I made on how to use it: https://til.simonwillison.net/spatialite/knn
Thanks for the many references to geodata libraries and implementations. But my question is more general: about the algorithms behind those implementations and their efficiency. To put the original question more abstractly: Given an arbitrary set of points on a sphere's surface, what is the most efficient algorithm to calculate a subset of the points which are within a certain distance over the sphere's surface to a certain (other) reference point on the sphere's surface? In addition a related problem: Given an arbitrary set of points on a sphere's surface, what is the most efficient algorithm to order this set according to the distance over the sphere's surface of each point to a certain (other) reference point on the sphere's surface?
The quite simple algorithm would be to calculate the distance of each point to the reference point and use these values for filtering or sorting respectively. But there should exist more sophisticated algorithms, like prefiltering or presorting the original set before applying more complex calculations or using other types of coordinate systems as latitude/longitude.
A trivial example would be if the filtering distance is more than half of the sphere circumference; in this case we do not need to calculate anything at all, because every point must be inside the distance. Or if we are using latitude/longitude notation and our reference point is on a pole, then we can just use the latitude for sorting without the need of distance calculation. One can easily think of some other special cases where specific alorithms may be used to speed up the operation.
Furthermore, I can think of a prefiltering algorithm for the latitude/longitude notation, where I calculate the coordinates of the four points directly north, south, west and east of the reference point in the given surface distance. Let us call them the "Noth point, "South point", "West point" and "East point". Then I exclude all points with coordinates west of the West point, east from the East point, north from the North point and south from the South point. Likewise I calculate the coordinates directly north-west, north-east, south-west, and south-east of the reference point in the given surface distance; all points within the minimum latitude/longitude range are within the given surface distance. Therefore I need only to perform the full calculation for the remaining points. (Special care must be taken if we cross the equator or the zero-meridian, of course.)
This is just a simple optimization that I came up with myself. I suppose, however, that a number of experts have already thought much more deeply about this group of problems. What interests me, is to learn about their solutions. An Internet search led me, for instance, to forum posts that recomment k-d trees to determine k nearest neighbors on a sphere. But I could not find a deeper analysis of this strategy regarding its computational efficiency anywhere.
There has been a lot of work in this area. There is no best algorithm, it depends upon the problem that you are trying to solve. However, I think that r-trees have advantages over k-d trees for this 2d case. Alternatively, there are Quadtrees which have different trade-offs to r-trees. A very different approach is to use space-filling curves to reduce the 2d problem to a 1d problem; this approach is normally called geohashing. If you search on these terms then you can rapidly find resources and scientific papers to help you understand the algorithms.
For most real world problems the distance between points can either be calculated manually or points within a large (eg. global) dataset will be already divided in to mutually irrelevant sets (eg. customers generally don't walk across two borders within one trip, grid sections are precomputed, etc.), therefore this problem (near-optimal sort by distance over a large dataset) is pretty uncommon. For the cases where it is required, caching may be viable. For the cases where it is not possible to either do the math or cache, it may be possible to parallelize the problem or push it to the client. In the final case where none of these generic approaches are available, then you should sleep on it and come back the next day because you almost certainly missed something. If you really didn't miss anything, congratulations: you can justify more hardware. If you can't afford renting or buying more hardware then your dream has come true: you are probably a mathematics academic, and not a programmer, thus you can spend your days twiddling algorithms safe in the knowledge that someone, somewhere, might make money from it and it almost certainly won't be you.
Less tongue in cheek, if you enjoy algorithms make sure to read The Art of Computer Programming.
geocode-sqlite does support non-OSM hosted instances of Nominatum so find an alternative if you have large or regular geocoding tasks.
[1] https://operations.osmfoundation.org/policies/nominatim/