Geocode-sqlite: Geocode rows in an SQLite database table

teruakohatu · on May 17, 2021

Just keep in mind while there is no API key required for OSM's Nominatum, there are no guarantees and they explicitly discourage bulk geocoding, with no clarification on what is bulk. [1] The lack of clarification seems intentional, which is understandable.

geocode-sqlite does support non-OSM hosted instances of Nominatum so find an alternative if you have large or regular geocoding tasks.

[1] https://operations.osmfoundation.org/policies/nominatim/

_690g · on May 17, 2021

The example of Nominatum use at the geocode-sqlite page shows using a local, thus non-OSM hosted, instance of Nominatum.

nominatum = Nominatum(user_agent="this-is-me", domain="nominatum.local.dev", scheme="http")

I have not tried it myself, thus if it fails, could you please report on what goes wrong there?

mtmail · on May 17, 2021

> from geopy.geocoders import Nominatum

It's called Nominatim, not Nominatum. If you run into issues then reporting it on github repository might be the best option. It's not clear if the HN user is the author of the library.

wodenokoto · on May 17, 2021

Not to be confused with spatial lite which trended on the front page earlier this year: https://www.gaia-gis.it/fossil/libspatialite/index

rubyfan · on May 17, 2021

The name is misleading to me. The name implies you can geocode from sqlite querying but this is basically a python package that uses sqlite as the data store for locations to be geocoded and where the results are stored.

Archelaos · on May 17, 2021

I recently ran into the problem of how to efficiently filter a list of geo-locations based on a specific distance to a given geo-location. I could not find any discussion anywhere about possible algorithms. Does anyone have a recommendation (article, book, Web-site)?

nl · on May 17, 2021

The generally correct answer is: PostGis.

If approximate distances (with known error bounds that depend on lat/long) is appropriate then string prefix matching on GeoHash[1] is easy and quite beautiful in many ways.

Basically locations within the same X character string code will be within a known lat/long square of each other. The longer the string the smaller the lat/long square.

[1] https://www.movable-type.co.uk/scripts/geohash.html, https://en.wikipedia.org/wiki/Geohash

simonw · on May 17, 2021

The SpatiaLite extension for SQLite has a KNN implementation that is designed to handle this problem. Here are some notes I made on how to use it: https://til.simonwillison.net/spatialite/knn

pezezin · on May 17, 2021

If you can use PostgreSQL, the obvious solution is PostGIS and the "geography" data type:

https://postgis.net/ https://postgis.net/docs/postgis_usage.html#PostGIS_Geograph...

mtmail · on May 17, 2021

https://gis.stackexchange.com/ and to a lesser extend https://help.openstreetmap.org/

remram · on May 17, 2021

If we're still talking SQLite, the RTree extension might be sufficient for this? https://sqlite.org/rtree.html

DonnyV · on May 17, 2021

You can use the NetTopologySuite for C# or Java. Use buffer function. https://github.com/NetTopologySuite/NetTopologySuite

floathub · on May 17, 2021

Pythagoras?

Archelaos · on May 17, 2021

Thanks for the many references to geodata libraries and implementations. But my question is more general: about the algorithms behind those implementations and their efficiency. To put the original question more abstractly: Given an arbitrary set of points on a sphere's surface, what is the most efficient algorithm to calculate a subset of the points which are within a certain distance over the sphere's surface to a certain (other) reference point on the sphere's surface? In addition a related problem: Given an arbitrary set of points on a sphere's surface, what is the most efficient algorithm to order this set according to the distance over the sphere's surface of each point to a certain (other) reference point on the sphere's surface?

The quite simple algorithm would be to calculate the distance of each point to the reference point and use these values for filtering or sorting respectively. But there should exist more sophisticated algorithms, like prefiltering or presorting the original set before applying more complex calculations or using other types of coordinate systems as latitude/longitude.

A trivial example would be if the filtering distance is more than half of the sphere circumference; in this case we do not need to calculate anything at all, because every point must be inside the distance. Or if we are using latitude/longitude notation and our reference point is on a pole, then we can just use the latitude for sorting without the need of distance calculation. One can easily think of some other special cases where specific alorithms may be used to speed up the operation.

Furthermore, I can think of a prefiltering algorithm for the latitude/longitude notation, where I calculate the coordinates of the four points directly north, south, west and east of the reference point in the given surface distance. Let us call them the "Noth point, "South point", "West point" and "East point". Then I exclude all points with coordinates west of the West point, east from the East point, north from the North point and south from the South point. Likewise I calculate the coordinates directly north-west, north-east, south-west, and south-east of the reference point in the given surface distance; all points within the minimum latitude/longitude range are within the given surface distance. Therefore I need only to perform the full calculation for the remaining points. (Special care must be taken if we cross the equator or the zero-meridian, of course.)

This is just a simple optimization that I came up with myself. I suppose, however, that a number of experts have already thought much more deeply about this group of problems. What interests me, is to learn about their solutions. An Internet search led me, for instance, to forum posts that recomment k-d trees to determine k nearest neighbors on a sphere. But I could not find a deeper analysis of this strategy regarding its computational efficiency anywhere.

robochat · on May 17, 2021

There has been a lot of work in this area. There is no best algorithm, it depends upon the problem that you are trying to solve. However, I think that r-trees have advantages over k-d trees for this 2d case. Alternatively, there are Quadtrees which have different trade-offs to r-trees. A very different approach is to use space-filling curves to reduce the 2d problem to a 1d problem; this approach is normally called geohashing. If you search on these terms then you can rapidly find resources and scientific papers to help you understand the algorithms.

contingencies · on May 17, 2021

For most real world problems the distance between points can either be calculated manually or points within a large (eg. global) dataset will be already divided in to mutually irrelevant sets (eg. customers generally don't walk across two borders within one trip, grid sections are precomputed, etc.), therefore this problem (near-optimal sort by distance over a large dataset) is pretty uncommon. For the cases where it is required, caching may be viable. For the cases where it is not possible to either do the math or cache, it may be possible to parallelize the problem or push it to the client. In the final case where none of these generic approaches are available, then you should sleep on it and come back the next day because you almost certainly missed something. If you really didn't miss anything, congratulations: you can justify more hardware. If you can't afford renting or buying more hardware then your dream has come true: you are probably a mathematics academic, and not a programmer, thus you can spend your days twiddling algorithms safe in the knowledge that someone, somewhere, might make money from it and it almost certainly won't be you.

Less tongue in cheek, if you enjoy algorithms make sure to read The Art of Computer Programming.

pkw792 · on May 17, 2021

If you're interested in math and algorithms more than simple (and perhaps correct) solutions, this might be of interest:

https://en.wikipedia.org/wiki/Vincenty%27s_formulae