I am currently involved in a project where the program I am writing is supposed to return a list of the 10 doctors nearest to a visitor's zip code. Note that this is not the same as doctors in a given radius (say 10 miles), but it does use the same basic formula. This article will be the first of a two part series and lays the groundwork for calculating what is nearest to a zip code. In the second part I intend to show how I integrate this logic with a list of doctors with addresses.

Below you will find the results of my experimental queries that compare the classic radius methodology with my own "10 nearest" methodology. My production program uses a UDB (DB2) database, and I thought of the performance contrasts between UDB and MySQL was interesting, so I am going to show all the comparisons below. Finally, I wanted to show what differences there were between the more accurate "Haversine Formula" as compared to the less intensive "Spherical Law of Cosines" when calculating the distance between two points on a sphere. These examples should give you plenty of real life code you can use in your own programs.

The company I work for subscribes to the zip code database (referred to here as ZipUSA) from http://www.zipdatafiles.com/data/ The data comes from the United States Postal Service and other sources, but I have occasionally found strange geocoding results. I will probably write a different article about that using a different application that I have written.### Within 10 miles

#### Haversine Formula

The following finds the zip codes within 10 miles of 27712 using the haversine (great circle) equation. The field names are based on the ZipUSA tables.

SELECT DISTINCT

destination.zip_code,

3956 * 2 * ASIN(SQRT(

POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +

COS(origin.lat * 0.0174532925) *

COS(destination.lat * 0.0174532925) *

POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)

)) distance

FROM

tblZipUSA origin,

tblZipUSA destination

WHERE

origin.zip_code = '27712'

AND

3956 * 2 * ASIN(SQRT(

POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +

COS(origin.lat * 0.0174532925) *

COS(destination.lat * 0.0174532925) *

POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)

)) < 10

ORDER BY

distance, zip_code

The ZipUSA database was set up with the same indexes in MySQL and UDB (DB2). Queries in MySQL were run in the phpAdmin tool. The UDB/DB2 queries were run in Quest.

##### MySQL

zip_code | distance |
---|---|

27712 | 0 |

27704 | 4.75326962990057 |

27705 | 5.34287681627437 |

27708 | 5.57383981806715 |

27503 | 5.58564401941337 |

27701 | 6.41245709075393 |

27706 | 6.58029004252293 |

27709 | 6.59449266010451 |

27702 | 6.61586916445661 |

27710 | 6.61586916445661 |

27711 | 6.61586916445661 |

27715 | 6.61586916445661 |

27717 | 6.61586916445661 |

27722 | 6.61586916445661 |

27707 | 8.57121741472298 |

27703 | 9.5828790270889 |

27278 | 9.58552637370736 |

27564 | 9.72570727493623 |

27509 | 9.80777888765073 |

Showing rows 0 - 18 (19 total, Query took 2.1139 sec)

##### DB2/UDB

zip_code | distance |
---|---|

27712 | 0 |

27704 | 4.75326962990023 |

27705 | 5.34287681627435 |

27708 | 5.57383981806714 |

27503 | 5.58564401941335 |

27701 | 6.41245709075397 |

27706 | 6.58029004252301 |

27709 | 6.5944926601046 |

27702 | 6.61586916445641 |

27710 | 6.61586916445641 |

27711 | 6.61586916445641 |

27715 | 6.61586916445641 |

27717 | 6.61586916445641 |

27722 | 6.61586916445641 |

27707 | 8.571217414723011 |

27703 | 9.582879027088699 |

27278 | 9.58552637370768 |

27564 | 9.725707274936291 |

27509 | 9.80777888765007 |

19 rows selected in 35.77 secs.

I was very shocked to have such a slow run time in UDB! I was expecting better performance from a corporate-level product. I also went back and checked all the indexes to verify they were on the same fields as were in my MySQL database. I do not have enough knowledge about UDB to explain why this disparity exists or how one might optimize the query to run better on this platform. I would not think one should have to do that.

#### Spherical Law of Cosines

SELECT DISTINCT

destination.zip_code,

3956 * ACOS(

SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +

COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *

COS((destination.lng - origin.lng) * 0.0174532925)

) AS distance

FROM

tblZipUSA origin,

tblZipUSA destination

WHERE

origin.zip_code = '27712'

AND

3956 * ACOS(

SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +

COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *

COS((destination.lng - origin.lng) * 0.0174532925)

) < 10

ORDER BY

distance, zip_code

##### MySQL

zip_code | distance |
---|---|

27712 | 0 |

27704 | 4.75326963012033 |

27705 | 5.34287681651191 |

27708 | 5.57383981817397 |

27503 | 5.58564401945863 |

27701 | 6.41245709087189 |

27706 | 6.58029004273494 |

27709 | 6.59449266019943 |

27702 | 6.61586916464669 |

27710 | 6.61586916464669 |

27711 | 6.61586916464669 |

27715 | 6.61586916464669 |

27717 | 6.61586916464669 |

27722 | 6.61586916464669 |

27707 | 8.57121741478871 |

27703 | 9.58287902708428 |

27278 | 9.58552637385399 |

27564 | 9.72570727482013 |

27509 | 9.80777888768795 |

Showing rows 0 - 18 (19 total, Query took 1.9229 sec)

##### DB2/UDB

zip_code | distance |
---|---|

27712 | 0 |

27704 | 4.75326963012037 |

27705 | 5.3428768165119 |

27708 | 5.57383981817396 |

27503 | 5.58564401945862 |

27701 | 6.41245709087191 |

27706 | 6.58029004247087 |

27709 | 6.59449266019946 |

27702 | 6.61586916438409 |

27710 | 6.61586916438409 |

27711 | 6.61586916438409 |

27715 | 6.61586916438409 |

27717 | 6.61586916438409 |

27722 | 6.61586916438409 |

27707 | 8.571217414991439 |

27703 | 9.58287902726561 |

27278 | 9.58552637385398 |

27564 | 9.72570727482014 |

27509 | 9.807778887510789 |

19 rows selected in 11.05 secs.

It is obvious that using the Spherical Law of Cosines equation is much faster than the Haversine Formula. MySQL shows a 5% improvement in speed and UDB shows a 53% performance improvement. It is also obvious that the distances calculated are very close. For these mile calculations, the first seven decimal places are the same. In most applications I have seen, the miles are rounded off to two decimal places, meaning that there is no reason to use the Haversine formula for distances as far apart as most "centers of zip codes" are. Haversine calculations are more suited to much closer calculations. Also, since zip code to zip code distances are only used for approximate distances, it would not make sense to nit-pick over the only-so-slightly more accurate results the Haversine formula gives. And even beyond that, accuracy depends a lot on the latitude and longitude data supplied. I have found that different sources have different geocoding information concerning the geographic center of zip codes.

### 10th closest

The next part of this experiment was to find the 10 nearest zip codes to the one given. The first step is finding out which zip ranks number 10. In this case it is simply a matter of choosing the 10th zip code from the list.

#### Haversine Formula

SELECT DISTINCT

destination.zip_code,

3956 * 2 * ASIN(SQRT(

POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +

COS(origin.lat * 0.0174532925) *

COS(destination.lat * 0.0174532925) *

POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)

)) distance

FROM

tblZipUSA origin,

tblZipUSA destination

WHERE

origin.zip_code = '27712'

ORDER BY

distance, zip_code

LIMIT 10, 1

Zip Code | Distance |
---|---|

27711 | 6.61586916445661 |

Showing rows 0 - 0 (1 total, Query took 3.4441 sec)

#### Spherical Law of Cosines

SELECT DISTINCT

destination.zip_code,

3956 * ACOS(

SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +

COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *

COS((destination.lng - origin.lng) * 0.0174532925)

) AS distance

FROM

tblZipUSA origin,

tblZipUSA destination

WHERE

origin.zip_code = '27712'

ORDER BY

distance, zip_code

LIMIT 10, 1

zip_code | distance |
---|---|

27711 | 6.61586916464669 |

Showing rows 0 - 0 (1 total, Query took 3.1833 sec)

### 10th Closest Without PO Box Zip Codes

If you have a stiuation where you will be working only with street addresses and can eliminate the zip codes, there is some performance improvement. In the ZipUSA tables, this is accomplished by placing fac_cd and zip_class in the WHERE clause. However, I soon found that 1) some of our doctors are using P.O. Boxes, and 2) some small towns only have post offices with PO Boxes. I ended up not using that further past this experiment.

SELECT DISTINCT

destination.zip_code,

(3956 * (2 * ASIN(SQRT(

POWER(SIN(((origin.lat-destination.lat)*0.017453293)/2),2) +

COS(origin.lat*0.017453293) *

COS(destination.lat*0.017453293) *

POWER(SIN(((origin.lng-destination.lng)*0.017453293)/2),2)

)))) distance

FROM

tblZipUSA origin,

tblZipUSA destination

WHERE

origin.zip_code='27712'

AND destination.fac_cd = 'P'

AND destination.zip_class != 'P'

ORDER BY distance

LIMIT 10,1

Zip Code | Distance |
---|---|

27703 | 9.58287927196898 |

Showing rows 0 - 0 (1 total, Query took 2.4226 sec)

### Nearest 10

Now that we can find the 10th closest zip code, we can join the two types of queries together to find all the zip codes up to the closest 10th. As a pleasant side effect, if there is a tie for 10th place, all of those zip codes are included as well. This becomes particularly obvious in the case where several P.O. Box zip codes are served from the same post office, and thus have the same geocoding. Of course, it is possible that there will be two or more physically distinct zip codes that will be equidistant from a given reference zip code, but I can only imagine that would be a very rare case.

#### Haversine Formula

SELECT DISTINCT

destination.zip_code,

3956 * 2 * ASIN(SQRT(

POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +

COS(origin.lat * 0.0174532925) *

COS(destination.lat * 0.0174532925) *

POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)

)) distance

FROM

tblZipUSA origin,

tblZipUSA destination,

(

SELECT DISTINCT

destination1.zip_code,

3956 * 2 * ASIN(SQRT(

POWER(SIN((origin1.lat - destination1.lat) * 0.0174532925 / 2), 2) +

COS(origin1.lat * 0.0174532925) *

COS(destination1.lat * 0.0174532925) *

POWER(SIN((origin1.lng - destination1.lng) * 0.0174532925 / 2), 2)

)) distance

FROM

tblZipUSA origin1,

tblZipUSA destination1

WHERE

origin1.zip_code = '27712'

ORDER BY

distance, zip_code

LIMIT 10, 1

) zipDistance

WHERE

origin.zip_code = '27712'

AND

3956 * 2 * ASIN(SQRT(

POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +

COS(origin.lat * 0.0174532925) *

COS(destination.lat * 0.0174532925) *

POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)

)) <= zipDistance.distance

ORDER BY

distance, zip_code

zip_code | distance |
---|---|

27712 | 0 |

27704 | 4.75326962990057 |

27705 | 5.34287681627437 |

27708 | 5.57383981806715 |

27503 | 5.58564401941337 |

27701 | 6.41245709075393 |

27706 | 6.58029004252293 |

27709 | 6.59449266010451 |

27702 | 6.61586916445661 |

27710 | 6.61586916445661 |

27711 | 6.61586916445661 |

27715 | 6.61586916445661 |

27717 | 6.61586916445661 |

27722 | 6.61586916445661 |

Showing rows 0 - 13 (14 total, Query took 5.6745 sec)

#### Spherical Law of Cosines

SELECT DISTINCT

destination.zip_code,

3956 * ACOS(

SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +

COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *

COS((destination.lng - origin.lng) * 0.0174532925)

) AS distance

FROM

tblZipUSA origin,

tblZipUSA destination,

(

SELECT DISTINCT

destination1.zip_code,

3956 * ACOS(

SIN(origin1.lat * 0.0174532925) * SIN(destination1.lat * 0.0174532925) +

COS(origin1.lat * 0.0174532925) * COS(destination1.lat * 0.0174532925) *

COS((destination1.lng - origin1.lng) * 0.0174532925)

) AS distance

FROM

tblZipUSA origin1,

tblZipUSA destination1

WHERE

origin1.zip_code = '27712'

ORDER BY

distance, zip_code

LIMIT 10, 1

) zipDistance

WHERE

origin.zip_code = '27712'

AND

3956 * ACOS(

SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +

COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *

COS((destination.lng - origin.lng) * 0.0174532925)

) <= zipDistance.distance

ORDER BY

distance, zip_code

zip_code | distance |
---|---|

27712 | 0 |

27704 | 4.75326963012033 |

27705 | 5.34287681651191 |

27708 | 5.57383981817397 |

27503 | 5.58564401945863 |

27701 | 6.41245709087189 |

27706 | 6.58029004273494 |

27709 | 6.59449266019943 |

27702 | 6.61586916464669 |

27710 | 6.61586916464669 |

27711 | 6.61586916464669 |

27715 | 6.61586916464669 |

27717 | 6.61586916464669 |

27722 | 6.61586916464669 |

Showing rows 0 - 13 (14 total, Query took 5.1353 sec)

So there you have it. In the second part of this series I will show how I use this same kind of logic to find the 10 closest locations from a database of addresses. Until then, enjoy!