Friday, March 06, 2009

ELIZA, the Computer Therapist Facebook application

I have converted the classic (read: old) ELIZA application to run on Facebook. ELIZA is a Rogerian therapist parody originally written by Joseph Weizenbaum. This application is for entertainment purposes only. ELIZA is not human and is not a real therapist. If you have a session with ELIZA that you think was amusing or thought provoking, you can publish it on your profile.

Please give it a try at:
http://apps.facebook.com/eliza-therapist/

I would appreciate your feedback.

This is my first Facebook application. I have ideas for two more at the moment, but they are going to take more work than this one... I wanted to start with something simple :-).

If you are not on Facebook, I have also made an online version at ThreeLeaf.com:
http://www.threeleaf.com/freelance-work/demos/eliza/

Enjoy!
John

Labels: , ,

Thursday, October 02, 2008

Gateway Extended Warranty... Definitely a Waste of Money… Maybe the Computer Too

This falls into the category of "you should have known better." My first experience with a Gateway computer was as a graduate student at the University of Wyoming. The first thing I heard about these computers is that the motherboard had a design flaw, which resulted in the CPU occasionally popping out of its socket. Someone would then have to open the case and reseat it whenever this happened. I worked with a lot of Gateway computers as a PC technician at Research Triangle Institute. During my 3 year tenure in that position, RTI signed a purchase agreement with Gateway, and I received, installed, and repaired approximately 200 Gateway computers during that time. I do not have an exact count, but I am pretty sure nearly every one of those units needed repair at some point during that short period of time. The worse case was a batch of 26 computers whose hard drives all gave out within 4 months of being received.

Many years later, in 2005, I am looking at purchasing a new computer. I included Gateway in my list of options despite my past experiences with them, hoping that they may have finally learned to build reliable computers despite their long history of poor quality. When it came to decision time, Gateway was one of my final two contenders. I ended up going with Gateway because they offered more features for a lower price (not inexpensive... this was one of their top-of-the-line PCs). I purposely purchased the four year extended warranty, hedging my bet that if I did have problems, the PC would be covered. The customer service while I was at RTI typically was typically good, and they would often ship replacement parts to me within 2-3 business days, and after I had replaced them, I could send the old parts back. I am not naïve enough to think that "regular" customers would get the same level of support as corporate customers, but I was really unprepared for what my situation would be.

In June of this year (2008), the PC suddenly stopped working. It would not boot. It would not even POST. It was dead as a doornail with no apparent cause. After running through the online support tests, I eventually determined that it had to be a motherboard problem. Not a problem, I thought, I will put in a service request and get this fixed up. My first issue became what the four-year warranty actually covered. I was certain that I had purchased on-site service. My first several correspondents ranged from "your warranty expired" to "it only covers parts" to "even if you had on-site service, they would only replace the part, service representatives do not visit to diagnose the problem." My online receipt has a cryptic entry for the four year warranty, so it is quite possible that it does only cover replacement. The service people I corresponded with said I should check my receipt or order a replacement receipt. Unfortunately, even though I have all my manuals, I could not find my receipt for some reason. I did order a replacement, which involves both an online order form and an e-mail confirmation, but I did not receive a receipt, though it does not appear that they charged my card either. Good grief! It is hard for me to understand why a simple process that has a manual customer service follow-up can still fail.

Ok, so by this time I have been without my PC for about 3 weeks bickering with Gateway over the warranty and receipt. I decide it is not worth the hassle anymore. I purchased their $67 pre-paid shipping box (miraculously, that order went through) and sent my computer off. On August 8, I got a confirmation e-mail that Gateway received my PC, and I should expect a 3-5 day turnaround.

So I wait.

And wait.

Finally, after two weeks, I send in an e-mail asking for a status report. I get a message back indicating that they determined it was indeed a motherboard problem, and they did not have the part in stock, but it had been ordered. The computer should be ready in 3-5 more days.

So I wait.

After a week with no communication, I write again. Again, I am told that the repair is taking longer than expected, but it should be shipped in a couple of days.

So I wait.

After another week, I am pretty unhappy that I have no computer, and had no communication. I sent a more terse e-mail asking for an update and some kind of immediate solution, like an equivalent replacement. My request was returned with a profuse apology about the delay, and an assurance that they were doing everything they could. However, that obviously did not include a replacement or a loaner.

So I wait.

Again, after another week I send another exasperated e-mail in, requesting action, and a phone number where I can call someone. I received a reply simply stating that they were working on it. No phone number, of course.

So I wait.

Finally! A communication arrived that I did not have to initiate! They said they had finished the repair on my laptop, but needed my permission to wipe the hard drive to reinstall the operating system. I replied that I had sent in a PC, not a laptop, but if they were talking about the PC then yes, of course they could reinstall the operating system (I had actually sent it in with a blank drive, because I cannot lose what is on the hard drive). About three days later I get an e-mail stating that they were finished and were shipping the PC back to me. I should receive it in 3-5 days, unless I had purchased the gold plan, in which case I would receive it in one day (I do not think I had the gold plan, but I could wait 5 days if needed, and in my opinion it was pointless to include that note since it would have been too late to purchase it anyway) Hurray! … Or so I thought.

So I wait.

After another week goes by I send in a request for the tracking number, since I had not yet received the PC. They responded saying that the package had been received and included the tracking number. I go to the FedEx site and find that the destination was Temple, TX (I live in NC). So, I wrote back and said that they had shipped it to the wrong address, and that they needed to either recover the PC or send me an equivalent. I get a response back that the PC needed more repairs and was shipped to Temple, TX, where they contract out repairs. Actually, they said it was still "in transit," indicating to me that the really had no idea where the PC was.

So I wait.

It has now been another week since my last e-mail, and I am about to e-mail them again for a status update. I have no idea if Gateway really knows where my PC is, what state of repair it is in, nor do I know if I will ever get my PC or a replacement.

I have reminded them in several of my e-mails that I paid a lot for the PC and more for the extended warranty, but except for the one apologetic message, I have received no sympathy and no action to help resolve the problems. This level of service is completely abominable, and unacceptable. It certainly seems to me that if you purchase a four-year warranty that there is a built-in expectation that the seller would have some plan to deal with any and all repairs during that timeframe, even if it is just to send an equivalent computer back (I suspect an equivalent replacement today costs less than the price I paid for the warranty).

I am posting this blog entry so that my friends, family, and readers will reconsider any thoughts they might have of purchasing Gateway computers. As I stated at the beginning, I should have known better, but I will never make that mistake again, and I hope my readers will not either.

Labels: , ,

Friday, January 25, 2008

The onContextMenu Event Handler

onContextMenu event does not fire on disabled fields


The onContextMenu event (right-click in Windows) is a proprietary handler in HTML, but is very useful for displaying custom context menus on web pages. For this particular article, I was investigating an application where the business owner wanted a context menu available on disabled fields. The problem specifically occurs because all click events are apparently ignored on disabled fields. Note that Opera apparently ignores the onContextMenu event completely, because it is not a W3C compliant attribute.


Regardless of the W3C non-compliance, the business still wanted this behavior (as far as we know everyone is using IE or FF). Try the following in your browser to see how it behaves.













<input id="input1" type="text" value="input1" oncontextmenu="alert('input1');return false;" />

Attempt 1: Nest it in a div


The first thing we tried was to wrap the input fields in a div element that had an onContextMenu attribute defined. We found that while this solved the problem in IE7, FF did not "bubble up" the event through the disabled input field. However, any text inside the div element did respond to the onContextMenu event. Try the following in your browser to see how it behaves.





Inside div, outside input3 field





Inside div, outside input4 field





<div oncontextmenu="alert('input3');return false;">Inside div, outside input3 field

<input id="input3" type="text" value="input3" />

</div>

Solution 1: Convert from disabled to read-only


After a little more research, I found another attribute named "readonly" that I had not used before. I tried it, and found that the onContextMenu event worked just fine in both IE7 and FF.









Inside div, outside input6 field





<input id="input5" type="text" value="input5" readonly="readonly" oncontextmenu="alert('input5');return false;" />

Solution 2: Make an external, clickable object


I definitely prefer W3C compliant code, however, so a compliant solution would have to involve an external, clickable element to trigger the event. The » symbol below has an onClick event, which works in IE, FF, and Opera.





»



<input id="input7" type="text" value="input7" disabled="disabled" /> <span onclick="alert('input7');return false;" style="cursor:pointer;">&raquo;</span>

There are some accessibility issues with this approach, of course. One might want to use a link instead of a span to underline the clickable element or choose a more obvious graphical symbol. There are other inherent accessibility issues with the entire context menu approach, but I will not go into those here.

Wednesday, April 04, 2007

Zip Code Distances

I am currently involved in a project where the program I am writing is supposed to return a list of the 10 doctors nearest to a visitor's zip code. Note that this is not the same as doctors in a given radius (say 10 miles), but it does use the same basic formula. This article will be the first of a two part series and lays the groundwork for calculating what is nearest to a zip code. In the second part I intend to show how I integrate this logic with a list of doctors with addresses.

Below you will find the results of my experimental queries that compare the classic radius methodology with my own "10 nearest" methodology. My production program uses a UDB (DB2) database, and I thought of the performance contrasts between UDB and MySQL was interesting, so I am going to show all the comparisons below. Finally, I wanted to show what differences there were between the more accurate "Haversine Formula" as compared to the less intensive "Spherical Law of Cosines" when calculating the distance between two points on a sphere. These examples should give you plenty of real life code you can use in your own programs.

The company I work for subscribes to the zip code database (referred to here as ZipUSA) from http://www.zipdatafiles.com/data/ The data comes from the United States Postal Service and other sources, but I have occasionally found strange geocoding results. I will probably write a different article about that using a different application that I have written.

Within 10 miles

Haversine Formula

The following finds the zip codes within 10 miles of 27712 using the haversine (great circle) equation. The field names are based on the ZipUSA tables.

SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'
AND
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) < 10

ORDER BY
distance, zip_code

The ZipUSA database was set up with the same indexes in MySQL and UDB (DB2). Queries in MySQL were run in the phpAdmin tool. The UDB/DB2 queries were run in Quest.


MySQL















































































zip_code distance
27712 0
27704 4.75326962990057
27705 5.34287681627437
27708 5.57383981806715
27503 5.58564401941337
27701 6.41245709075393
27706 6.58029004252293
27709 6.59449266010451
27702 6.61586916445661
27710 6.61586916445661
27711 6.61586916445661
27715 6.61586916445661
27717 6.61586916445661
27722 6.61586916445661
27707 8.57121741472298
27703 9.5828790270889
27278 9.58552637370736
27564 9.72570727493623
27509 9.80777888765073

Showing rows 0 - 18 (19 total, Query took 2.1139 sec)



DB2/UDB



















































































zip_code distance
27712 0
27704 4.75326962990023
27705 5.34287681627435
27708 5.57383981806714
27503 5.58564401941335
27701 6.41245709075397
27706 6.58029004252301
27709 6.5944926601046
27702 6.61586916445641
27710 6.61586916445641
27711 6.61586916445641
27715 6.61586916445641
27717 6.61586916445641
27722 6.61586916445641
27707 8.571217414723011
27703 9.582879027088699
27278 9.58552637370768
27564 9.725707274936291
27509 9.80777888765007

19 rows selected in 35.77 secs.


I was very shocked to have such a slow run time in UDB! I was expecting better performance from a corporate-level product. I also went back and checked all the indexes to verify they were on the same fields as were in my MySQL database. I do not have enough knowledge about UDB to explain why this disparity exists or how one might optimize the query to run better on this platform. I would not think one should have to do that.



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'
AND
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) < 10

ORDER BY
distance, zip_code


MySQL



















































































zip_code distance
27712 0
27704 4.75326963012033
27705 5.34287681651191
27708 5.57383981817397
27503 5.58564401945863
27701 6.41245709087189
27706 6.58029004273494
27709 6.59449266019943
27702 6.61586916464669
27710 6.61586916464669
27711 6.61586916464669
27715 6.61586916464669
27717 6.61586916464669
27722 6.61586916464669
27707 8.57121741478871
27703 9.58287902708428
27278 9.58552637385399
27564 9.72570727482013
27509 9.80777888768795

Showing rows 0 - 18 (19 total, Query took 1.9229 sec)



DB2/UDB



















































































zip_code distance
27712 0
27704 4.75326963012037
27705 5.3428768165119
27708 5.57383981817396
27503 5.58564401945862
27701 6.41245709087191
27706 6.58029004247087
27709 6.59449266019946
27702 6.61586916438409
27710 6.61586916438409
27711 6.61586916438409
27715 6.61586916438409
27717 6.61586916438409
27722 6.61586916438409
27707 8.571217414991439
27703 9.58287902726561
27278 9.58552637385398
27564 9.72570727482014
27509 9.807778887510789

19 rows selected in 11.05 secs.



It is obvious that using the Spherical Law of Cosines equation is much faster than the Haversine Formula. MySQL shows a 5% improvement in speed and UDB shows a 53% performance improvement. It is also obvious that the distances calculated are very close. For these mile calculations, the first seven decimal places are the same. In most applications I have seen, the miles are rounded off to two decimal places, meaning that there is no reason to use the Haversine formula for distances as far apart as most "centers of zip codes" are. Haversine calculations are more suited to much closer calculations. Also, since zip code to zip code distances are only used for approximate distances, it would not make sense to nit-pick over the only-so-slightly more accurate results the Haversine formula gives. And even beyond that, accuracy depends a lot on the latitude and longitude data supplied. I have found that different sources have different geocoding information concerning the geographic center of zip codes.


10th closest




The next part of this experiment was to find the 10 nearest zip codes to the one given. The first step is finding out which zip ranks number 10. In this case it is simply a matter of choosing the 10th zip code from the list.


Haversine Formula



SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1










Zip Code Distance
27711 6.61586916445661

Showing rows 0 - 0 (1 total, Query took 3.4441 sec)



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1










zip_code distance
27711 6.61586916464669

Showing rows 0 - 0 (1 total, Query took 3.1833 sec)



10th Closest Without PO Box Zip Codes


If you have a stiuation where you will be working only with street addresses and can eliminate the zip codes, there is some performance improvement. In the ZipUSA tables, this is accomplished by placing fac_cd and zip_class in the WHERE clause. However, I soon found that 1) some of our doctors are using P.O. Boxes, and 2) some small towns only have post offices with PO Boxes. I ended up not using that further past this experiment.



SELECT DISTINCT
destination.zip_code,
(3956 * (2 * ASIN(SQRT(
POWER(SIN(((origin.lat-destination.lat)*0.017453293)/2),2) +
COS(origin.lat*0.017453293) *
COS(destination.lat*0.017453293) *
POWER(SIN(((origin.lng-destination.lng)*0.017453293)/2),2)
)))) distance

FROM
tblZipUSA origin,
tblZipUSA destination

WHERE
origin.zip_code='27712'
AND destination.fac_cd = 'P'
AND destination.zip_class != 'P'

ORDER BY distance
LIMIT 10,1










Zip Code Distance
27703 9.58287927196898

Showing rows 0 - 0 (1 total, Query took 2.4226 sec)



Nearest 10



Now that we can find the 10th closest zip code, we can join the two types of queries together to find all the zip codes up to the closest 10th. As a pleasant side effect, if there is a tie for 10th place, all of those zip codes are included as well. This becomes particularly obvious in the case where several P.O. Box zip codes are served from the same post office, and thus have the same geocoding. Of course, it is possible that there will be two or more physically distinct zip codes that will be equidistant from a given reference zip code, but I can only imagine that would be a very rare case.



Haversine Formula



SELECT DISTINCT
destination.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin,
tblZipUSA destination,
(
SELECT DISTINCT
destination1.zip_code,
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin1.lat - destination1.lat) * 0.0174532925 / 2), 2) +
COS(origin1.lat * 0.0174532925) *
COS(destination1.lat * 0.0174532925) *
POWER(SIN((origin1.lng - destination1.lng) * 0.0174532925 / 2), 2)
)) distance

FROM
tblZipUSA origin1,
tblZipUSA destination1

WHERE
origin1.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1
) zipDistance

WHERE
origin.zip_code = '27712'
AND
3956 * 2 * ASIN(SQRT(
POWER(SIN((origin.lat - destination.lat) * 0.0174532925 / 2), 2) +
COS(origin.lat * 0.0174532925) *
COS(destination.lat * 0.0174532925) *
POWER(SIN((origin.lng - destination.lng) * 0.0174532925 / 2), 2)
)) <= zipDistance.distance

ORDER BY
distance, zip_code






























































zip_code distance
27712 0
27704 4.75326962990057
27705 5.34287681627437
27708 5.57383981806715
27503 5.58564401941337
27701 6.41245709075393
27706 6.58029004252293
27709 6.59449266010451
27702 6.61586916445661
27710 6.61586916445661
27711 6.61586916445661
27715 6.61586916445661
27717 6.61586916445661
27722 6.61586916445661

Showing rows 0 - 13 (14 total, Query took 5.6745 sec)



Spherical Law of Cosines



SELECT DISTINCT
destination.zip_code,
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin,
tblZipUSA destination,
(
SELECT DISTINCT
destination1.zip_code,
3956 * ACOS(
SIN(origin1.lat * 0.0174532925) * SIN(destination1.lat * 0.0174532925) +
COS(origin1.lat * 0.0174532925) * COS(destination1.lat * 0.0174532925) *
COS((destination1.lng - origin1.lng) * 0.0174532925)
) AS distance

FROM
tblZipUSA origin1,
tblZipUSA destination1

WHERE
origin1.zip_code = '27712'

ORDER BY
distance, zip_code

LIMIT 10, 1
) zipDistance
WHERE
origin.zip_code = '27712'
AND
3956 * ACOS(
SIN(origin.lat * 0.0174532925) * SIN(destination.lat * 0.0174532925) +
COS(origin.lat * 0.0174532925) * COS(destination.lat * 0.0174532925) *
COS((destination.lng - origin.lng) * 0.0174532925)
) <= zipDistance.distance

ORDER BY
distance, zip_code






























































zip_code distance
27712 0
27704 4.75326963012033
27705 5.34287681651191
27708 5.57383981817397
27503 5.58564401945863
27701 6.41245709087189
27706 6.58029004273494
27709 6.59449266019943
27702 6.61586916464669
27710 6.61586916464669
27711 6.61586916464669
27715 6.61586916464669
27717 6.61586916464669
27722 6.61586916464669

Showing rows 0 - 13 (14 total, Query took 5.1353 sec)



So there you have it. In the second part of this series I will show how I use this same kind of logic to find the 10 closest locations from a database of addresses. Until then, enjoy!

Sunday, February 11, 2007

On the Use of Captchas

In an earlier post, PHP Form Signature I developed a bit of code to render form attacking bots harmless. On my own web sites I did expand the code to create a random hidden form field with the signature information, and on all the web sites I have implemented the signature methodology the obviously scripted form attacks were completely thwarted. At the same time, there is no usability impact since the mechanism works behind the scenes, completely hidden from the site visitor.

However, as I predicted, one of my forms was spammed by a bot that simply harvested the form signature and reposted it along with its spam payload. Looking at the contents of the spam I immediately recognized that it was coded to work in a blog. This was not a targeted attack by someone who ran across a form on the site and developed a script to try to exploit it. Instead, this was a bot that was programmed to roam the web looking for any form that has a text area in it, and post there hoping that it will show up as a comment in someone's blog. This kind of attacker doesn't care whether any particular form submission works or fails, knowing only that there are enough unprotected blogs and similar public forums that will instantly display anonymous posts.

This is exactly the kind of thing captchas were developed to prevent. However, as I have stated before, a captcha places a stumbling block in the way of innocent site visitors who truly do want to communicate. I have been on several sites that use a captcha on every form submission, which gets very annoying. The dilemma, then, is how one tells the difference between a robot and a human. After all, a robot can post HTTP header information to make it look like the POST is coming from an ordinary web browser.

The main difference, that I can tell, is that nearly none of my human correspondents place links in their form submissions, but every one of the spam posts I have recorded contain one or more links. So, I have added to my form validation sequence a check for a link reference in posts. If it detects one, it re-presents the form with a captcha and prompts the visitor to fill in the letters they see in the image. A robot will never see that page, of course, but neither will their form POST be completely processed. The human visitor, who has likely seen captchas before, may still be annoyed with this interruption, but are much more likely to fill in the field as requested in order to complete their correspondence. In this way, I, as the web developer, protect my server and e-mail box, while providing a normal user experience to most of the visitors who use my forms.

For the purposes of this example, I am using Ed Eliot's Visual and Audio PHP CAPTCHA Generation Class. The particular example I am giving also involves a form that posts to its same address ($_SERVER['SCRIPT_NAME']). This means that the model, or processing code, first checks to see if the request is a GET (automatically resulting in the form being presented) or a POST (which initiates the validation cycle).

At the beginning of the model section of code, then, before any processing takes place, I have the following lines of code:


$useCaptcha = 1; // flag to indicate captcha can be displayed if needed
$requireCaptcha = 0; // flag to indicate that a captcha condition has been met
require_once('php-captcha.inc.php'); // Ed Elliot's captcha class


Before I run the validation code, I fill in the $captchaTriggers array with any string that should cause a captcha to be displayed. The following will capture all links and images pasted in the form fields. You can add other rules, of course.


$captchaTriggers[] = 'http';
$captchaTriggers[] = '<img';


As part of my validation, I loop through the expected field names (I will likely cover details of my validation procedure in a later post, but this should give you the general idea). I check all the text fields for any of the captcha triggers. If one is found, then it's position is added to the $requireCaptcha variable.


if($_POST[$thisFieldName] > ''){
    for ($j=0; $j< sizeof($captchaTriggers); $j++) {
        $requireCaptcha += strpos(strtolower(' ' . $_POST[$thisFieldName]), strtolower($captchaTriggers[$j]));
    }
...


At the end of the validation phase, I have the following code. If any of the captcha trigger strings was found in the above code, the value of $requireCaptcha will be greater than zero, which is one of the two conditions that need to be met ($useCaptcha being the second). If the conditions are met, the code then checks to see if the $_POST['captchaCode'] has been set, and if it validates against Elliot's captcha class. The first pass through, the captchaCode variable would not be present, of course, which then leads to the errorMessages array being set. On subsequent passes through, invalid captcha codes would continue to fail while a valid one will complete the processing (unless there are other validation errors, of course).


if($requireCaptcha > 0 && isset($useCaptcha)){
    if(!isset($_POST['captchaCode'])) {
        $_POST['captchaCode'] = '';
    }
    if (!PhpCaptcha::Validate($_POST['captchaCode'])) {
        $errorMessages[] = "Please enter the letters you see in the graphic.";
        $errorFields[] = 'captchaCode';
    }
}


All of my validation checks use the $errorMessages array, so my check on whether or not there were any validation errors is the determining factor on whether or not to display the form for corrections. In the form, I do another check for $requireCaptcha to determine whether or not to display the captcha image and field.


<? if($requireCaptcha > 0) { ?>
        <div class="formRow">
            <div class="formLabel"><label for="captchaCode"><img src="/assets/images/visual-captcha.php" width="100" height="40" alt="Visual CAPTCHA" /></label></div>
            <div class="formField"><input type="text" id="captchaCode" name="captchaCode" value="" style="width:10em;" maxlength="10" accesskey="c" tabindex="5" /></div>
        </div>
<? } ?>


Since I have implemented this code, I have not received any robot generated spam. One possible improvement is to set a cookie in the client browser when a captcha challenge is successfully answered. The cookie would essentially say "this user has already proven that they are human, so you do not need to challenge them in the future."

Hope this helps you!

Wednesday, January 31, 2007

Dynamic Copyright Dates

Copyright notices remind web site visitors that the content of a web page belongs to the web site owners and cannot be legally reproduced without permission (with some exceptions - search for other resources for more information). To fully protect the content of a site, most web site owners place the copyright notice in the footer of each page. It is normal to update the copyright year with the current year.

When most web sites are launched, they have only a few pages, and if they use includes, normally only have one footer file. It is quite easy to simply hard code the current year in the footer. However, as web sites grow larger, the maintenance can become more difficult if the copyright has been hard coded in every page or if more than one footer becomes necessary (for example if a web site uses multiple designs for different sections of the site or if different sections are written in different languages).

Rather than hard coding the year, it may be more efficient to dynamically generate the year in each footer. Below I have example snippets showing how this can be done in a few different languages. Feel free to add your own suggestions.

PHP



&copy; <? print date("Y"); ?>

ColdFusion



&copy; <cfoutput>#dateFormat(Now(), "yyyy")#</cfoutput>

JSP



<%@ page language="java" contentType="text/html" session="true" %>
<%@ taglib prefix="fmt" uri="http://java.sun.com/jsp/jstl/fmt" %>
<jsp:useBean id="now" class="java.util.Date" />
...
&copy; <fmt:formatDate value="${now}" pattern="yyyy" />

JavaScript


The following code can be used in static HTML pages or in place of any of the above. It requires no server side work, but it does mean that the browsing client must have JavaScript enabled.

&copy; <script type="text/javascript">
d=new Date();document.write(d.getFullYear());
</script>

Friday, December 29, 2006

PHP Form Signature

In my previous post, I began developing the concept of creating a checksum for a form to help prevent spammers from abusing forms on web sites. A malicious spammer will first attempt to inject code into a form in an effort to send e-mails through the web site's host. Those attempts can usually be foiled by properly validating the form fields. However, a spammer may still inject their marketing materials into a form that appears to send an e-mail to someone, not really caring where that e-mail might wind up. If these e-mails are directed at a server administrator or the customer support staff, they can get very annoying, even if they are otherwise rendered harmless by the validation rules. The goal then is to try to ensure that a given form post if from a human being, and not from a robot. Visual and verbal "captchas" can be used, but that tends to annoy the human visitor, which we don't want to do.

The idea of using a checksum and a form timeout is one way a web developer might attempt to prevent a spammer from writing bots against forms. I am expanding on the idea of the checksum and will now call it a signature. This example is in PHP, and will work as-is, but I certainly encourage you to make your own variations.

In my previous post, I developed an example with two hidden fields which were used to validate the form post. In this example I do two things: 1) combine the timestamp and the hash to create one "signature" field, and 2) encrypt the timestamp to make it less obvious. These measures should make it more difficult for a robot to spoof posts to your forms. Note that with any encryption and hash scheme, it is possible to break the code, but the objective here is to make it so arduous that the spammer would rather look elsewhere for targets than your forms. My opinion is that it would probably be easier for someone to hack the server than this form validation technique, so one must not overlook hardening the rest of the server, of course.

Since you might use this code in several places, consider making it an include:


<?
// http://rc4crypt.devhome.org/
include_once 'cryptography.php';

function formSignatureCreate($formPassword){
    $formTime = array_sum(explode(' ', microtime()));
    $encryptedTime = encrypt($formTime, $formPassword);
    $hash = formSignatureHash($formTime, $formPassword . $encryptedTime);
    return $hash . $encryptedTime;
}

function formSignatureValidate($formSignature, $formPassword){
    $encryptedTime = substr($formSignature, 32);
    $formTime = decrypt($encryptedTime, $formPassword);
    $hash = formSignatureHash($formTime, $formPassword . $encryptedTime);
    if(!is_numeric($formTime)) {
        return false;
    }
    $currentTime = array_sum(explode(' ', microtime()));
    if($currentTime - $formTime > 1800) {
        return false;
    }
    if($_SERVER['HTTP_USER_AGENT'] == ''){
        return false;   
    }
    if($hash == substr($formSignature, 0, 32)) {
        return true;
    }
    return false;
}

function formSignatureHash($formTime, $formPassword){
    return md5($formTime . $_SERVER['REMOTE_ADDR'] . $_SERVER['HTTP_USER_AGENT'] . $_SERVER['SERVER_NAME'] . $formPassword);
}
?>

The encrypt and decrypt functions are wrappers I have placed around the rc4crypt code. You may use any encryption technique you want, of course, as I am only showing this as an example. These functions include a conversion of the encrypted string to base 64, so I do not have to deal with making the string web-friendly at this level.

For the time stamp ($formTime) , I have used the microtime() PHP function for simplicity, but you could use a formatted time as well, along with appropriate changes in the formSignatureValidate() function. This time stamp is encrypted with a password designated by the programmer. It is probably a best practice to have a unique password with each form, and since it is on the programming side, it can be a random sequence of characters.

The hash is generated in the formSignatureHash() function, which combines the following elements:

  • $formTime - The time the form was created, which allows us to validate that the form POST was submitted within a certain time frame.

  • REMOTE_ADDR - To tie the form post with the client's computer.

  • HTTP_USER_AGENT - To tie the form post with the client's browser.

  • SERVER_NAME - To tie the form to a given host server.

  • $formPassword - To tie the form POST to a given form.


Just to further confound any reverse-engineering, I combine the encrypted time with the given form password when the hash function is called.

Within the form, add a hidden field for the signature:

<? $signature = formSignatureCreate('thisFormPassword'); ?>
<input type="hidden" name="signature" value="<? print $signature; ?>" />

The way I have set this example up, a new signature would be created if other validation checks failed. This eliminates the need to validate the basic form (i.e., alpha-numeric) of the signature itself before printing it back into the form. We would not want someone attempting to inject code into an unvalidated signature variable!

In your validation section, add the signature check:

if(formSignatureValidate($signature, 'thisFormPassword')) {
    // other form validation
} else {
    // spawn error message
}

During validation, I make this my primary check. If the post does not pass the signature check, there is no reason to do any other validation, since it will always be rejected. In my error messaging system, I do not even display the input form again, forcing the visitor to click on a link to start the original form over. You may wish to restart a little more graciously with a blank form.

The formSignatureValidate() function first decrypts the time portion of the signature. In this example, the hash portion is always 32 characters long, so it is easy to pull the encrypted time off the end. Since I have used the microtime() function, it is easy to check to be sure the timestamp is 1) numeric and 2) within the 30 minute (1,800 second) time frame allowed.

I do an additional check on HTTP_USER_AGENT only because one of my sites was once attacked by a robot without a name. I don't bother checking during the signature creation phase, since it is really only important in the validation phase. If you know of specific rogue agents or IP addresses, those can be globally addressed in an .htaccess file or in the server configuration.

The final check is that the regenerated hash matches the hash at the beginning of the signature.

If all these checks pass, then it is considered a genuine post.

I have thought of a way to defeat this methodology. Someone could certainly develop a script to "screen scrape" the signature dynamically, but hopefully, a spammer will consider that too much effort. If the need arises, it is likely one could come up with a methodology to dynamically create the signature name as well. Time will tell. In the meantime, I hope this helps someone. I hope to develop a Java/JSP version of this in the near future.