I am currently involved in a project where the program I am writing is supposed to return a list of the 10 doctors nearest to a visitor's zip code. Note that this is not the same as doctors in a given radius (say 10 miles), but it does use the same basic formula. This article will be the first of a two part series and lays the groundwork for calculating what is nearest to a zip code. In the second part I intend to show how I integrate this logic with a list of doctors with addresses.
Below you will find the results of my experimental queries that compare the classic radius methodology with my own "10 nearest" methodology. My production program uses a UDB (DB2) database, and I thought of the performance contrasts between UDB and MySQL was interesting, so I am going to show all the comparisons below. Finally, I wanted to show what differences there were between the more accurate "Haversine Formula" as compared to the less intensive "Spherical Law of Cosines" when calculating the distance between two points on a sphere. These examples should give you plenty of real life code you can use in your own programs.
The company I work for subscribes to the zip code database (referred to here as ZipUSA) from http://www.zipdatafiles.com/data/ The data comes from the United States Postal Service and other sources, but I have occasionally found strange geocoding results. I will probably write a different article about that using a different application that I have written.Within 10 miles
Haversine Formula
The following finds the zip codes within 10 miles of 27712 using the haversine (great circle) equation. The field names are based on the ZipUSA tables.
SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance
FROM
tblZipUSA origin,
tblZipUSA destination
WHERE
origin.zip_code = '27712'
AND
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) < 10
ORDER BY
distance, zip_code
The ZipUSA database was set up with the same indexes in MySQL and UDB (DB2). Queries in MySQL were run in the phpAdmin tool. The UDB/DB2 queries were run in Quest.
MySQL
zip_code | distance |
---|---|
27712 | 0 |
27704 | 4.75326962990057 |
27705 | 5.34287681627437 |
27708 | 5.57383981806715 |
27503 | 5.58564401941337 |
27701 | 6.41245709075393 |
27706 | 6.58029004252293 |
27709 | 6.59449266010451 |
27702 | 6.61586916445661 |
27710 | 6.61586916445661 |
27711 | 6.61586916445661 |
27715 | 6.61586916445661 |
27717 | 6.61586916445661 |
27722 | 6.61586916445661 |
27707 | 8.57121741472298 |
27703 | 9.5828790270889 |
27278 | 9.58552637370736 |
27564 | 9.72570727493623 |
27509 | 9.80777888765073 |
Showing rows 0 - 18 (19 total, Query took 2.1139 sec)
DB2/UDB
zip_code | distance |
---|---|
27712 | 0 |
27704 | 4.75326962990023 |
27705 | 5.34287681627435 |
27708 | 5.57383981806714 |
27503 | 5.58564401941335 |
27701 | 6.41245709075397 |
27706 | 6.58029004252301 |
27709 | 6.5944926601046 |
27702 | 6.61586916445641 |
27710 | 6.61586916445641 |
27711 | 6.61586916445641 |
27715 | 6.61586916445641 |
27717 | 6.61586916445641 |
27722 | 6.61586916445641 |
27707 | 8.571217414723011 |
27703 | 9.582879027088699 |
27278 | 9.58552637370768 |
27564 | 9.725707274936291 |
27509 | 9.80777888765007 |
19 rows selected in 35.77 secs.
I was very shocked to have such a slow run time in UDB! I was expecting better performance from a corporate-level product. I also went back and checked all the indexes to verify they were on the same fields as were in my MySQL database. I do not have enough knowledge about UDB to explain why this disparity exists or how one might optimize the query to run better on this platform. I would not think one should have to do that.
Spherical Law of Cosines
SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance
FROM
tblZipUSA origin,
tblZipUSA destination
WHERE
origin.zip_code = '27712'
AND
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) < 10
ORDER BY
distance, zip_code
MySQL
zip_code | distance |
---|---|
27712 | 0 |
27704 | 4.75326963012033 |
27705 | 5.34287681651191 |
27708 | 5.57383981817397 |
27503 | 5.58564401945863 |
27701 | 6.41245709087189 |
27706 | 6.58029004273494 |
27709 | 6.59449266019943 |
27702 | 6.61586916464669 |
27710 | 6.61586916464669 |
27711 | 6.61586916464669 |
27715 | 6.61586916464669 |
27717 | 6.61586916464669 |
27722 | 6.61586916464669 |
27707 | 8.57121741478871 |
27703 | 9.58287902708428 |
27278 | 9.58552637385399 |
27564 | 9.72570727482013 |
27509 | 9.80777888768795 |
Showing rows 0 - 18 (19 total, Query took 1.9229 sec)
DB2/UDB
zip_code | distance |
---|---|
27712 | 0 |
27704 | 4.75326963012037 |
27705 | 5.3428768165119 |
27708 | 5.57383981817396 |
27503 | 5.58564401945862 |
27701 | 6.41245709087191 |
27706 | 6.58029004247087 |
27709 | 6.59449266019946 |
27702 | 6.61586916438409 |
27710 | 6.61586916438409 |
27711 | 6.61586916438409 |
27715 | 6.61586916438409 |
27717 | 6.61586916438409 |
27722 | 6.61586916438409 |
27707 | 8.571217414991439 |
27703 | 9.58287902726561 |
27278 | 9.58552637385398 |
27564 | 9.72570727482014 |
27509 | 9.807778887510789 |
19 rows selected in 11.05 secs.
It is obvious that using the Spherical Law of Cosines equation is much faster than the Haversine Formula. MySQL shows a 5% improvement in speed and UDB shows a 53% performance improvement. It is also obvious that the distances calculated are very close. For these mile calculations, the first seven decimal places are the same. In most applications I have seen, the miles are rounded off to two decimal places, meaning that there is no reason to use the Haversine formula for distances as far apart as most "centers of zip codes" are. Haversine calculations are more suited to much closer calculations. Also, since zip code to zip code distances are only used for approximate distances, it would not make sense to nit-pick over the only-so-slightly more accurate results the Haversine formula gives. And even beyond that, accuracy depends a lot on the latitude and longitude data supplied. I have found that different sources have different geocoding information concerning the geographic center of zip codes.
10th closest
The next part of this experiment was to find the 10 nearest zip codes to the one given. The first step is finding out which zip ranks number 10. In this case it is simply a matter of choosing the 10th zip code from the list.
Haversine Formula
SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance
FROM
tblZipUSA origin,
tblZipUSA destination
WHERE
origin.zip_code = '27712'
ORDER BY
distance, zip_code
LIMIT 10, 1
Zip Code | Distance |
---|---|
27711 | 6.61586916445661 |
Showing rows 0 - 0 (1 total, Query took 3.4441 sec)
Spherical Law of Cosines
SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance
FROM
tblZipUSA origin,
tblZipUSA destination
WHERE
origin.zip_code = '27712'
ORDER BY
distance, zip_code
LIMIT 10, 1
zip_code | distance |
---|---|
27711 | 6.61586916464669 |
Showing rows 0 - 0 (1 total, Query took 3.1833 sec)
10th Closest Without PO Box Zip Codes
If you have a stiuation where you will be working only with street addresses and can eliminate the zip codes, there is some performance improvement. In the ZipUSA tables, this is accomplished by placing fac_cd and zip_class in the WHERE clause. However, I soon found that 1) some of our doctors are using P.O. Boxes, and 2) some small towns only have post offices with PO Boxes. I ended up not using that further past this experiment.
SELECT DISTINCT
destination.zip_code,
(3956 * (2 * ASIN(SQRT(
POWER(SIN(((origin.lat-destination.lat)*0.017453293)/2),2) +
COS(origin.lat*0.017453293) *
COS(destination.lat*0.017453293) *
POWER(SIN(((origin.lng-destination.lng)*0.017453293)/2),2)
)))) distance
FROM
tblZipUSA origin,
tblZipUSA destination
WHERE
origin.zip_code='27712'
AND destination.fac_cd = 'P'
AND destination.zip_class != 'P'
ORDER BY distance
LIMIT 10,1
Zip Code | Distance |
---|---|
27703 | 9.58287927196898 |
Showing rows 0 - 0 (1 total, Query took 2.4226 sec)
Nearest 10
Now that we can find the 10th closest zip code, we can join the two types of queries together to find all the zip codes up to the closest 10th. As a pleasant side effect, if there is a tie for 10th place, all of those zip codes are included as well. This becomes particularly obvious in the case where several P.O. Box zip codes are served from the same post office, and thus have the same geocoding. Of course, it is possible that there will be two or more physically distinct zip codes that will be equidistant from a given reference zip code, but I can only imagine that would be a very rare case.
Haversine Formula
SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance
FROM
tblZipUSA origin,
tblZipUSA destination,
(
SELECT DISTINCT
destination1.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin1.lat - destination1.lat) * 0.0174532925 / 2), 2) +
COS(origin1.lat * 0.0174532925) *
COS(destination1.lat * 0.0174532925) *
POWER(SIN((origin1.lng - destination1.lng) * 0.0174532925 / 2), 2)
)) distance
FROM
tblZipUSA origin1,
tblZipUSA destination1
WHERE
origin1.zip_code = '27712'
ORDER BY
distance, zip_code
LIMIT 10, 1
) zipDistance
WHERE
origin.zip_code = '27712'
AND
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) <= zipDistance.distance
ORDER BY
distance, zip_code
zip_code | distance |
---|---|
27712 | 0 |
27704 | 4.75326962990057 |
27705 | 5.34287681627437 |
27708 | 5.57383981806715 |
27503 | 5.58564401941337 |
27701 | 6.41245709075393 |
27706 | 6.58029004252293 |
27709 | 6.59449266010451 |
27702 | 6.61586916445661 |
27710 | 6.61586916445661 |
27711 | 6.61586916445661 |
27715 | 6.61586916445661 |
27717 | 6.61586916445661 |
27722 | 6.61586916445661 |
Showing rows 0 - 13 (14 total, Query took 5.6745 sec)
Spherical Law of Cosines
SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance
FROM
tblZipUSA origin,
tblZipUSA destination,
(
SELECT DISTINCT
destination1.zip_code,
3956 * ACOS(
SIN(origin1.lat * 0.0174532925) * SIN(destination1.lat * 0.0174532925) +
COS(origin1.lat * 0.0174532925) * COS(destination1.lat * 0.0174532925) *
COS((destination1.lng - origin1.lng) * 0.0174532925)
) AS distance
FROM
tblZipUSA origin1,
tblZipUSA destination1
WHERE
origin1.zip_code = '27712'
ORDER BY
distance, zip_code
LIMIT 10, 1
) zipDistance
WHERE
origin.zip_code = '27712'
AND
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) <= zipDistance.distance
ORDER BY
distance, zip_code
zip_code | distance |
---|---|
27712 | 0 |
27704 | 4.75326963012033 |
27705 | 5.34287681651191 |
27708 | 5.57383981817397 |
27503 | 5.58564401945863 |
27701 | 6.41245709087189 |
27706 | 6.58029004273494 |
27709 | 6.59449266019943 |
27702 | 6.61586916464669 |
27710 | 6.61586916464669 |
27711 | 6.61586916464669 |
27715 | 6.61586916464669 |
27717 | 6.61586916464669 |
27722 | 6.61586916464669 |
Showing rows 0 - 13 (14 total, Query took 5.1353 sec)
So there you have it. In the second part of this series I will show how I use this same kind of logic to find the 10 closest locations from a database of addresses. Until then, enjoy!