Wednesday, April 04, 2007

Zip Code Distances

I am currently involved in a project where the program I am writing is supposed to return a list of the 10 doctors nearest to a visitor's zip code. Note that this is not the same as doctors in a given radius (say 10 miles), but it does use the same basic formula. This article will be the first of a two part series and lays the groundwork for calculating what is nearest to a zip code. In the second part I intend to show how I integrate this logic with a list of doctors with addresses.

Below you will find the results of my experimental queries that compare the classic radius methodology with my own "10 nearest" methodology. My production program uses a UDB (DB2) database, and I thought of the performance contrasts between UDB and MySQL was interesting, so I am going to show all the comparisons below. Finally, I wanted to show what differences there were between the more accurate "Haversine Formula" as compared to the less intensive "Spherical Law of Cosines" when calculating the distance between two points on a sphere. These examples should give you plenty of real life code you can use in your own programs.

The company I work for subscribes to the zip code database (referred to here as ZipUSA) from http://www.zipdatafiles.com/data/ The data comes from the United States Postal Service and other sources, but I have occasionally found strange geocoding results. I will probably write a different article about that using a different application that I have written.

Within 10 miles

Haversine Formula

The following finds the zip codes within 10 miles of 27712 using the haversine (great circle) equation. The field names are based on the ZipUSA tables.

SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'
AND
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) < 10

ORDER BY
distance, zip_code

The ZipUSA database was set up with the same indexes in MySQL and UDB (DB2). Queries in MySQL were run in the phpAdmin tool. The UDB/DB2 queries were run in Quest.


MySQL















































































zip_code distance
27712 0
27704 4.75326962990057
27705 5.34287681627437
27708 5.57383981806715
27503 5.58564401941337
27701 6.41245709075393
27706 6.58029004252293
27709 6.59449266010451
27702 6.61586916445661
27710 6.61586916445661
27711 6.61586916445661
27715 6.61586916445661
27717 6.61586916445661
27722 6.61586916445661
27707 8.57121741472298
27703 9.5828790270889
27278 9.58552637370736
27564 9.72570727493623
27509 9.80777888765073

Showing rows 0 - 18 (19 total, Query took 2.1139 sec)



DB2/UDB



















































































zip_code distance
27712 0
27704 4.75326962990023
27705 5.34287681627435
27708 5.57383981806714
27503 5.58564401941335
27701 6.41245709075397
27706 6.58029004252301
27709 6.5944926601046
27702 6.61586916445641
27710 6.61586916445641
27711 6.61586916445641
27715 6.61586916445641
27717 6.61586916445641
27722 6.61586916445641
27707 8.571217414723011
27703 9.582879027088699
27278 9.58552637370768
27564 9.725707274936291
27509 9.80777888765007

19 rows selected in 35.77 secs.


I was very shocked to have such a slow run time in UDB! I was expecting better performance from a corporate-level product. I also went back and checked all the indexes to verify they were on the same fields as were in my MySQL database. I do not have enough knowledge about UDB to explain why this disparity exists or how one might optimize the query to run better on this platform. I would not think one should have to do that.



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'
AND
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) < 10

ORDER BY
distance, zip_code


MySQL



















































































zip_code distance
27712 0
27704 4.75326963012033
27705 5.34287681651191
27708 5.57383981817397
27503 5.58564401945863
27701 6.41245709087189
27706 6.58029004273494
27709 6.59449266019943
27702 6.61586916464669
27710 6.61586916464669
27711 6.61586916464669
27715 6.61586916464669
27717 6.61586916464669
27722 6.61586916464669
27707 8.57121741478871
27703 9.58287902708428
27278 9.58552637385399
27564 9.72570727482013
27509 9.80777888768795

Showing rows 0 - 18 (19 total, Query took 1.9229 sec)



DB2/UDB



















































































zip_code distance
27712 0
27704 4.75326963012037
27705 5.3428768165119
27708 5.57383981817396
27503 5.58564401945862
27701 6.41245709087191
27706 6.58029004247087
27709 6.59449266019946
27702 6.61586916438409
27710 6.61586916438409
27711 6.61586916438409
27715 6.61586916438409
27717 6.61586916438409
27722 6.61586916438409
27707 8.571217414991439
27703 9.58287902726561
27278 9.58552637385398
27564 9.72570727482014
27509 9.807778887510789

19 rows selected in 11.05 secs.



It is obvious that using the Spherical Law of Cosines equation is much faster than the Haversine Formula. MySQL shows a 5% improvement in speed and UDB shows a 53% performance improvement. It is also obvious that the distances calculated are very close. For these mile calculations, the first seven decimal places are the same. In most applications I have seen, the miles are rounded off to two decimal places, meaning that there is no reason to use the Haversine formula for distances as far apart as most "centers of zip codes" are. Haversine calculations are more suited to much closer calculations. Also, since zip code to zip code distances are only used for approximate distances, it would not make sense to nit-pick over the only-so-slightly more accurate results the Haversine formula gives. And even beyond that, accuracy depends a lot on the latitude and longitude data supplied. I have found that different sources have different geocoding information concerning the geographic center of zip codes.


10th closest




The next part of this experiment was to find the 10 nearest zip codes to the one given. The first step is finding out which zip ranks number 10. In this case it is simply a matter of choosing the 10th zip code from the list.


Haversine Formula



SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1










Zip Code Distance
27711 6.61586916445661

Showing rows 0 - 0 (1 total, Query took 3.4441 sec)



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1










zip_code distance
27711 6.61586916464669

Showing rows 0 - 0 (1 total, Query took 3.1833 sec)



10th Closest Without PO Box Zip Codes


If you have a stiuation where you will be working only with street addresses and can eliminate the zip codes, there is some performance improvement. In the ZipUSA tables, this is accomplished by placing fac_cd and zip_class in the WHERE clause. However, I soon found that 1) some of our doctors are using P.O. Boxes, and 2) some small towns only have post offices with PO Boxes. I ended up not using that further past this experiment.



SELECT DISTINCT
destination.zip_code,
(3956 * (2 * ASIN(SQRT(
POWER(SIN(((origin.lat-destination.lat)*0.017453293)/2),2) +
COS(origin.lat*0.017453293) *
COS(destination.lat*0.017453293) *
POWER(SIN(((origin.lng-destination.lng)*0.017453293)/2),2)
)))) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code='27712'
AND destination.fac_cd = 'P'
AND destination.zip_class != 'P'

ORDER BY distance
LIMIT 10,1










Zip Code Distance
27703 9.58287927196898

Showing rows 0 - 0 (1 total, Query took 2.4226 sec)



Nearest 10



Now that we can find the 10th closest zip code, we can join the two types of queries together to find all the zip codes up to the closest 10th. As a pleasant side effect, if there is a tie for 10th place, all of those zip codes are included as well. This becomes particularly obvious in the case where several P.O. Box zip codes are served from the same post office, and thus have the same geocoding. Of course, it is possible that there will be two or more physically distinct zip codes that will be equidistant from a given reference zip code, but I can only imagine that would be a very rare case.



Haversine Formula



SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination,
(
SELECT DISTINCT
destination1.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin1.lat - destination1.lat) * 0.0174532925 / 2), 2) +
COS(origin1.lat * 0.0174532925) *
COS(destination1.lat * 0.0174532925) *
POWER(SIN((origin1.lng - destination1.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin1,
tblZipUSA destination1

WHERE
origin1.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1
) zipDistance

WHERE
origin.zip_code = '27712'
AND
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) <= zipDistance.distance

ORDER BY
distance, zip_code






























































zip_code distance
27712 0
27704 4.75326962990057
27705 5.34287681627437
27708 5.57383981806715
27503 5.58564401941337
27701 6.41245709075393
27706 6.58029004252293
27709 6.59449266010451
27702 6.61586916445661
27710 6.61586916445661
27711 6.61586916445661
27715 6.61586916445661
27717 6.61586916445661
27722 6.61586916445661

Showing rows 0 - 13 (14 total, Query took 5.6745 sec)



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination,
(
SELECT DISTINCT
destination1.zip_code,
3956 * ACOS(
SIN(origin1.lat * 0.0174532925) * SIN(destination1.lat * 0.0174532925) +
COS(origin1.lat * 0.0174532925) * COS(destination1.lat * 0.0174532925) *
COS((destination1.lng - origin1.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin1,
tblZipUSA destination1

WHERE
origin1.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1
) zipDistance
WHERE
origin.zip_code = '27712'
AND
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) <= zipDistance.distance

ORDER BY
distance, zip_code






























































zip_code distance
27712 0
27704 4.75326963012033
27705 5.34287681651191
27708 5.57383981817397
27503 5.58564401945863
27701 6.41245709087189
27706 6.58029004273494
27709 6.59449266019943
27702 6.61586916464669
27710 6.61586916464669
27711 6.61586916464669
27715 6.61586916464669
27717 6.61586916464669
27722 6.61586916464669

Showing rows 0 - 13 (14 total, Query took 5.1353 sec)



So there you have it. In the second part of this series I will show how I use this same kind of logic to find the 10 closest locations from a database of addresses. Until then, enjoy!