Wednesday, April 04, 2007

Zip Code Distances

I am currently involved in a project where the program I am writing is supposed to return a list of the 10 doctors nearest to a visitor's zip code. Note that this is not the same as doctors in a given radius (say 10 miles), but it does use the same basic formula. This article will be the first of a two part series and lays the groundwork for calculating what is nearest to a zip code. In the second part I intend to show how I integrate this logic with a list of doctors with addresses.

Below you will find the results of my experimental queries that compare the classic radius methodology with my own "10 nearest" methodology. My production program uses a UDB (DB2) database, and I thought of the performance contrasts between UDB and MySQL was interesting, so I am going to show all the comparisons below. Finally, I wanted to show what differences there were between the more accurate "Haversine Formula" as compared to the less intensive "Spherical Law of Cosines" when calculating the distance between two points on a sphere. These examples should give you plenty of real life code you can use in your own programs.

The company I work for subscribes to the zip code database (referred to here as ZipUSA) from http://www.zipdatafiles.com/data/ The data comes from the United States Postal Service and other sources, but I have occasionally found strange geocoding results. I will probably write a different article about that using a different application that I have written.

Within 10 miles

Haversine Formula

The following finds the zip codes within 10 miles of 27712 using the haversine (great circle) equation. The field names are based on the ZipUSA tables.

SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'
AND
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) < 10

ORDER BY
distance, zip_code

The ZipUSA database was set up with the same indexes in MySQL and UDB (DB2). Queries in MySQL were run in the phpAdmin tool. The UDB/DB2 queries were run in Quest.


MySQL















































































zip_code distance
27712 0
27704 4.75326962990057
27705 5.34287681627437
27708 5.57383981806715
27503 5.58564401941337
27701 6.41245709075393
27706 6.58029004252293
27709 6.59449266010451
27702 6.61586916445661
27710 6.61586916445661
27711 6.61586916445661
27715 6.61586916445661
27717 6.61586916445661
27722 6.61586916445661
27707 8.57121741472298
27703 9.5828790270889
27278 9.58552637370736
27564 9.72570727493623
27509 9.80777888765073

Showing rows 0 - 18 (19 total, Query took 2.1139 sec)



DB2/UDB



















































































zip_code distance
27712 0
27704 4.75326962990023
27705 5.34287681627435
27708 5.57383981806714
27503 5.58564401941335
27701 6.41245709075397
27706 6.58029004252301
27709 6.5944926601046
27702 6.61586916445641
27710 6.61586916445641
27711 6.61586916445641
27715 6.61586916445641
27717 6.61586916445641
27722 6.61586916445641
27707 8.571217414723011
27703 9.582879027088699
27278 9.58552637370768
27564 9.725707274936291
27509 9.80777888765007

19 rows selected in 35.77 secs.


I was very shocked to have such a slow run time in UDB! I was expecting better performance from a corporate-level product. I also went back and checked all the indexes to verify they were on the same fields as were in my MySQL database. I do not have enough knowledge about UDB to explain why this disparity exists or how one might optimize the query to run better on this platform. I would not think one should have to do that.



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'
AND
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) < 10

ORDER BY
distance, zip_code


MySQL



















































































zip_code distance
27712 0
27704 4.75326963012033
27705 5.34287681651191
27708 5.57383981817397
27503 5.58564401945863
27701 6.41245709087189
27706 6.58029004273494
27709 6.59449266019943
27702 6.61586916464669
27710 6.61586916464669
27711 6.61586916464669
27715 6.61586916464669
27717 6.61586916464669
27722 6.61586916464669
27707 8.57121741478871
27703 9.58287902708428
27278 9.58552637385399
27564 9.72570727482013
27509 9.80777888768795

Showing rows 0 - 18 (19 total, Query took 1.9229 sec)



DB2/UDB



















































































zip_code distance
27712 0
27704 4.75326963012037
27705 5.3428768165119
27708 5.57383981817396
27503 5.58564401945862
27701 6.41245709087191
27706 6.58029004247087
27709 6.59449266019946
27702 6.61586916438409
27710 6.61586916438409
27711 6.61586916438409
27715 6.61586916438409
27717 6.61586916438409
27722 6.61586916438409
27707 8.571217414991439
27703 9.58287902726561
27278 9.58552637385398
27564 9.72570727482014
27509 9.807778887510789

19 rows selected in 11.05 secs.



It is obvious that using the Spherical Law of Cosines equation is much faster than the Haversine Formula. MySQL shows a 5% improvement in speed and UDB shows a 53% performance improvement. It is also obvious that the distances calculated are very close. For these mile calculations, the first seven decimal places are the same. In most applications I have seen, the miles are rounded off to two decimal places, meaning that there is no reason to use the Haversine formula for distances as far apart as most "centers of zip codes" are. Haversine calculations are more suited to much closer calculations. Also, since zip code to zip code distances are only used for approximate distances, it would not make sense to nit-pick over the only-so-slightly more accurate results the Haversine formula gives. And even beyond that, accuracy depends a lot on the latitude and longitude data supplied. I have found that different sources have different geocoding information concerning the geographic center of zip codes.


10th closest




The next part of this experiment was to find the 10 nearest zip codes to the one given. The first step is finding out which zip ranks number 10. In this case it is simply a matter of choosing the 10th zip code from the list.


Haversine Formula



SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1










Zip Code Distance
27711 6.61586916445661

Showing rows 0 - 0 (1 total, Query took 3.4441 sec)



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1










zip_code distance
27711 6.61586916464669

Showing rows 0 - 0 (1 total, Query took 3.1833 sec)



10th Closest Without PO Box Zip Codes


If you have a stiuation where you will be working only with street addresses and can eliminate the zip codes, there is some performance improvement. In the ZipUSA tables, this is accomplished by placing fac_cd and zip_class in the WHERE clause. However, I soon found that 1) some of our doctors are using P.O. Boxes, and 2) some small towns only have post offices with PO Boxes. I ended up not using that further past this experiment.



SELECT DISTINCT
destination.zip_code,
(3956 * (2 * ASIN(SQRT(
POWER(SIN(((origin.lat-destination.lat)*0.017453293)/2),2) +
COS(origin.lat*0.017453293) *
COS(destination.lat*0.017453293) *
POWER(SIN(((origin.lng-destination.lng)*0.017453293)/2),2)
)))) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code='27712'
AND destination.fac_cd = 'P'
AND destination.zip_class != 'P'

ORDER BY distance
LIMIT 10,1










Zip Code Distance
27703 9.58287927196898

Showing rows 0 - 0 (1 total, Query took 2.4226 sec)



Nearest 10



Now that we can find the 10th closest zip code, we can join the two types of queries together to find all the zip codes up to the closest 10th. As a pleasant side effect, if there is a tie for 10th place, all of those zip codes are included as well. This becomes particularly obvious in the case where several P.O. Box zip codes are served from the same post office, and thus have the same geocoding. Of course, it is possible that there will be two or more physically distinct zip codes that will be equidistant from a given reference zip code, but I can only imagine that would be a very rare case.



Haversine Formula



SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination,
(
SELECT DISTINCT
destination1.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin1.lat - destination1.lat) * 0.0174532925 / 2), 2) +
COS(origin1.lat * 0.0174532925) *
COS(destination1.lat * 0.0174532925) *
POWER(SIN((origin1.lng - destination1.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin1,
tblZipUSA destination1

WHERE
origin1.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1
) zipDistance

WHERE
origin.zip_code = '27712'
AND
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) <= zipDistance.distance

ORDER BY
distance, zip_code






























































zip_code distance
27712 0
27704 4.75326962990057
27705 5.34287681627437
27708 5.57383981806715
27503 5.58564401941337
27701 6.41245709075393
27706 6.58029004252293
27709 6.59449266010451
27702 6.61586916445661
27710 6.61586916445661
27711 6.61586916445661
27715 6.61586916445661
27717 6.61586916445661
27722 6.61586916445661

Showing rows 0 - 13 (14 total, Query took 5.6745 sec)



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination,
(
SELECT DISTINCT
destination1.zip_code,
3956 * ACOS(
SIN(origin1.lat * 0.0174532925) * SIN(destination1.lat * 0.0174532925) +
COS(origin1.lat * 0.0174532925) * COS(destination1.lat * 0.0174532925) *
COS((destination1.lng - origin1.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin1,
tblZipUSA destination1

WHERE
origin1.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1
) zipDistance
WHERE
origin.zip_code = '27712'
AND
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) <= zipDistance.distance

ORDER BY
distance, zip_code






























































zip_code distance
27712 0
27704 4.75326963012033
27705 5.34287681651191
27708 5.57383981817397
27503 5.58564401945863
27701 6.41245709087189
27706 6.58029004273494
27709 6.59449266019943
27702 6.61586916464669
27710 6.61586916464669
27711 6.61586916464669
27715 6.61586916464669
27717 6.61586916464669
27722 6.61586916464669

Showing rows 0 - 13 (14 total, Query took 5.1353 sec)



So there you have it. In the second part of this series I will show how I use this same kind of logic to find the 10 closest locations from a database of addresses. Until then, enjoy!

Sunday, February 11, 2007

On the Use of Captchas

In an earlier post, PHP Form Signature I developed a bit of code to render form attacking bots harmless. On my own web sites I did expand the code to create a random hidden form field with the signature information, and on all the web sites I have implemented the signature methodology the obviously scripted form attacks were completely thwarted. At the same time, there is no usability impact since the mechanism works behind the scenes, completely hidden from the site visitor.

However, as I predicted, one of my forms was spammed by a bot that simply harvested the form signature and reposted it along with its spam payload. Looking at the contents of the spam I immediately recognized that it was coded to work in a blog. This was not a targeted attack by someone who ran across a form on the site and developed a script to try to exploit it. Instead, this was a bot that was programmed to roam the web looking for any form that has a text area in it, and post there hoping that it will show up as a comment in someone's blog. This kind of attacker doesn't care whether any particular form submission works or fails, knowing only that there are enough unprotected blogs and similar public forums that will instantly display anonymous posts.

This is exactly the kind of thing captchas were developed to prevent. However, as I have stated before, a captcha places a stumbling block in the way of innocent site visitors who truly do want to communicate. I have been on several sites that use a captcha on every form submission, which gets very annoying. The dilemma, then, is how one tells the difference between a robot and a human. After all, a robot can post HTTP header information to make it look like the POST is coming from an ordinary web browser.

The main difference, that I can tell, is that nearly none of my human correspondents place links in their form submissions, but every one of the spam posts I have recorded contain one or more links. So, I have added to my form validation sequence a check for a link reference in posts. If it detects one, it re-presents the form with a captcha and prompts the visitor to fill in the letters they see in the image. A robot will never see that page, of course, but neither will their form POST be completely processed. The human visitor, who has likely seen captchas before, may still be annoyed with this interruption, but are much more likely to fill in the field as requested in order to complete their correspondence. In this way, I, as the web developer, protect my server and e-mail box, while providing a normal user experience to most of the visitors who use my forms.

For the purposes of this example, I am using Ed Eliot's Visual and Audio PHP CAPTCHA Generation Class. The particular example I am giving also involves a form that posts to its same address ($_SERVER['SCRIPT_NAME']). This means that the model, or processing code, first checks to see if the request is a GET (automatically resulting in the form being presented) or a POST (which initiates the validation cycle).

At the beginning of the model section of code, then, before any processing takes place, I have the following lines of code:


$useCaptcha = 1; // flag to indicate captcha can be displayed if needed
$requireCaptcha = 0; // flag to indicate that a captcha condition has been met
require_once('php-captcha.inc.php'); // Ed Elliot's captcha class


Before I run the validation code, I fill in the $captchaTriggers array with any string that should cause a captcha to be displayed. The following will capture all links and images pasted in the form fields. You can add other rules, of course.


$captchaTriggers[] = 'http';
$captchaTriggers[] = '<img';


As part of my validation, I loop through the expected field names (I will likely cover details of my validation procedure in a later post, but this should give you the general idea). I check all the text fields for any of the captcha triggers. If one is found, then it's position is added to the $requireCaptcha variable.


if($_POST[$thisFieldName] > ''){
    for ($j=0; $j< sizeof($captchaTriggers); $j++) {
        $requireCaptcha += strpos(strtolower(' ' . $_POST[$thisFieldName]), strtolower($captchaTriggers[$j]));
    }
...


At the end of the validation phase, I have the following code. If any of the captcha trigger strings was found in the above code, the value of $requireCaptcha will be greater than zero, which is one of the two conditions that need to be met ($useCaptcha being the second). If the conditions are met, the code then checks to see if the $_POST['captchaCode'] has been set, and if it validates against Elliot's captcha class. The first pass through, the captchaCode variable would not be present, of course, which then leads to the errorMessages array being set. On subsequent passes through, invalid captcha codes would continue to fail while a valid one will complete the processing (unless there are other validation errors, of course).


if($requireCaptcha > 0 && isset($useCaptcha)){
    if(!isset($_POST['captchaCode'])) {
        $_POST['captchaCode'] = '';
    }
    if (!PhpCaptcha::Validate($_POST['captchaCode'])) {
        $errorMessages[] = "Please enter the letters you see in the graphic.";
        $errorFields[] = 'captchaCode';
    }
}


All of my validation checks use the $errorMessages array, so my check on whether or not there were any validation errors is the determining factor on whether or not to display the form for corrections. In the form, I do another check for $requireCaptcha to determine whether or not to display the captcha image and field.


<? if($requireCaptcha > 0) { ?>
        <div class="formRow">
            <div class="formLabel"><label for="captchaCode"><img src="/assets/images/visual-captcha.php" width="100" height="40" alt="Visual CAPTCHA" /></label></div>
            <div class="formField"><input type="text" id="captchaCode" name="captchaCode" value="" style="width:10em;" maxlength="10" accesskey="c" tabindex="5" /></div>
        </div>
<? } ?>


Since I have implemented this code, I have not received any robot generated spam. One possible improvement is to set a cookie in the client browser when a captcha challenge is successfully answered. The cookie would essentially say "this user has already proven that they are human, so you do not need to challenge them in the future."

Hope this helps you!

Wednesday, January 31, 2007

Dynamic Copyright Dates

Copyright notices remind web site visitors that the content of a web page belongs to the web site owners and cannot be legally reproduced without permission (with some exceptions - search for other resources for more information). To fully protect the content of a site, most web site owners place the copyright notice in the footer of each page. It is normal to update the copyright year with the current year.

When most web sites are launched, they have only a few pages, and if they use includes, normally only have one footer file. It is quite easy to simply hard code the current year in the footer. However, as web sites grow larger, the maintenance can become more difficult if the copyright has been hard coded in every page or if more than one footer becomes necessary (for example if a web site uses multiple designs for different sections of the site or if different sections are written in different languages).

Rather than hard coding the year, it may be more efficient to dynamically generate the year in each footer. Below I have example snippets showing how this can be done in a few different languages. Feel free to add your own suggestions.

PHP



&copy; <? print date("Y"); ?>

ColdFusion



&copy; <cfoutput>#dateFormat(Now(), "yyyy")#</cfoutput>

JSP



<%@ page language="java" contentType="text/html" session="true" %>
<%@ taglib prefix="fmt" uri="http://java.sun.com/jsp/jstl/fmt" %>
<jsp:useBean id="now" class="java.util.Date" />
...
&copy; <fmt:formatDate value="${now}" pattern="yyyy" />

JavaScript


The following code can be used in static HTML pages or in place of any of the above. It requires no server side work, but it does mean that the browsing client must have JavaScript enabled.

&copy; <script type="text/javascript">
d=new Date();document.write(d.getFullYear());
</script>