Stopping spambots with hashes and honeypots
Created 21 January 2007
Spam sucks. Any site which allows unauthenticated users to submit forms will have a problem with spamming software (spambots) submitting junk content. A common technique to prevent spambots is CAPTCHA, which requires people to perform a task (usually noisy text recognition) that would be extremely difficult to do with software. But CAPTCHAs annoy users, and are becoming more difficult even for people to get right.
Rather than stopping bots by having people identify themselves, we can stop the bots by making it difficult for them to make a successful post, or by having them inadvertently identify themselves as bots. This removes the burden from people, and leaves the comment form free of visible anti-spam measures.
This technique is how I prevent spambots on this site. It works. The method described here doesn’t look at the content at all. It can be augmented with content-based prevention such as Akismet, but I find it works very well all by itself.
Know thy enemy
By watching how spammers fail to create spam on my site, there seem to be three different types of spam creators: Playback spambots, form-filling spambots, and humans.
Playback bots
These are bots which have recorded POST data which they replay back to the form submission URL. A person visits the form the first time, and records the form data. Certain fields are marked as slots to be filled in with randomized spam later, but the structure of the form is played back verbatim each time. This includes the names of the fields, and the contents of hidden fields.
These bots don’t even bother looking at the form as served by the site, but blindly post their canned data to the submission URL. Using unusual field names to avoid these bots will only work for a week or so, when they will then record the new field name, and begin posting with it.
A playback bot can be stopped by varying the hidden data on the form so that it will not be valid forever. A timestamp is a simple way to do this, making it possible to detect when old data is being replayed. The timestamp can be made tamper-proof by hashing it with a secret and including the hash in the hidden data of the form. Replaying can be further hindered by including the client’s IP address in the hash, so that data can’t even be immediately replayed across an army of spambots.
Form-filling bots
These bots read the form served by the site, and mechanically fill data into the fields. I don’t know if they understand common field names (email, name, subject) or not. On my site, I’ve observed bots that look at the type of the field, and fill in data based on the type. Single-line edit controls (type=text) get name, email, and subject, while textareas get the comment body. Some bots will fill the same data into all the fields of the same type, while others will enter (for example) different first names into each of the single-line fields.
Form-filling bots can be stopped by including editable fields on the form that are invisible to people. These fields are called honeypots and are validated when the form data is posted. If they contain any text, then the submitter must be a bot, and the submission is rejected.
Using randomized obscured field names, and strict validation can also stop these bots. If the email field must have an @-sign, and the name field must not, and the bot can’t tell which field is email and which is name, then the chances it will make a successful post have been greatly reduced.
Humans
These are actual people using your form. There’s nothing you can do to stop them, other than to remove the incentive. They want link traffic. Use the rel=”nofollow” attribute on all links, and be clear that you are doing it.
Building the bot-proof form
The comment form has four key components: timestamp, spinner, field names, and honeypots.
The timestamp is simply the number of seconds since some fixed point in time. For example, the PHP function time() follows the Unix convention of returning seconds since 1/1/1970.
The spinner is a hidden field used for a few things: it hashes together a number of values that prevent tampering and replays, and is used to obscure field names. The spinner is an MD5 hash of:
- The timestamp,
- The client’s IP address,
- The entry id of the blog entry being commented on, and
- A secret.
The field names on the form are all randomized. They are hashes of the real field name, the spinner, and a secret. The spinner gets a fixed field name, but all other fields on the form, including the submission buttons, use hashed field names.
Honeypot fields are invisible fields on the form. Invisible is different than hidden. Hidden is a type of field that is not displayed for editing. Bots understand hidden fields, because hidden fields often carry identifying information that has to be returned intact. Invisible fields are ordinary editable fields that have been made invisible in the browser.
The invisibility of the honeypot fields is a key way that bots reveal themselves. Because bots do not process the entirety of the HTML, CSS, and Javascript in the form, and because they do not build a visual representation of the page, and because they do not perceive the form as people do, they cannot distinguish invisible fields from visible ones. They will put data into honeypot fields because they don’t know any better.
The form is built as usual, including:
- editable fields for all of the information we want to collect from the user,
- hidden fields for identifying information, including the timestamp, the spinner, and the entry id,
- invisible honeypot fields of all types, including submission buttons.
Processing the post data
When the form is posted back to the server, a number of checks are made to determine if the form is valid. If any validation fails, the submission is rejected.
First the spinner field is read, and is used to hash all of the real field names into their hashed counterparts so that we can find data on the form.
The timestamp is checked. If it is too far in the past, or if it is in the future, the form is invalid. Of course a missing or non-integer timestamp is also a deal-breaker.
The value of the spinner is checked. The same hash that created it in the first place is re-computed to see that the spinner hasn’t been tampered with. (Note that this check isn’t actually necessary, since if the spinner had been modified, it wouldn’t have successfully hashed the timestamp field name and the timestamp verification would already have failed, but the extra check is harmless and reassuring.)
Check the honeypots. If any of them have any text in them, the submission is rejected.
Validate all the rest of the data as usual, for example, name, email, website, and so on.
At this point, if all of the validation succeeded, you know that you have a post from a human. You can also apply content-based spam prevention, but I have not found it to be necessary.
Making honeypots invisible
This is the essence of catching the bots. The idea here is to do something to keep the honeypot fields from being visible (or tempting) to people, but that bots won’t be able to pick up on. There are lots of possibilities. As you can see from looking at my comment form, I’ve simply added a style attribute that sets display:none, but there are lots of other ideas:
- Use CSS classes (randomized of course) to set the fields or a containing element to display:none.
- Color the fields the same (or very similar to) the background of the page.
- Use positioning to move a field off of the visible area of the page.
- Make an element too small to show the contained honeypot field.
- Leave the fields visible, but use positioning to cover them with an obscuring element.
- Use Javascript to effect any of these changes, requiring a bot to have a full Javascript engine.
- Leave the honeypots displayed like the other fields, but tell people not to enter anything into them.
Criticisms
Let me address a few common criticisms.
Defeatability
In theory, it is possible for a spambot to defeat any of these measures. But in practice, bots are very stupid, and the simplest trick will confuse them. Spam prevention doesn’t have to make it theoretically impossible to post spam, it just has to make it more difficult than most of the interesting forms on the internet. Spammers don’t make software that can post to any form, they make software that can post to many forms.
A relevant joke:
Jim and Joe are out hiking in the forest, when in the distance, they see a huge bear. The bear notices them, and begins angrily running toward them. Jim calmly checks the knots of his shoes and stretches his legs.
Joe asks incredulously, “What are you doing? Do you think you can outrun that bear!?”
Jim replies, “I don’t have to outrun the bear, I just have to outrun you.”
In any case, yes, spammers may eventually write spambots sophisticated enough to navigate honeypots properly. If and when they do, we can switch back to CAPTCHAs. In the meantime, honeypots work really well, and there are lots of ways to make them invisible we haven’t even needed to use yet.
Accessibility
Users that don’t use CSS or Javascript will be exposed to all of the honeypot fields. A simple solution is to label the fields so that these users will leave them untouched. As long as no text is entered into them, the form will submit just fine.
It works
This technique has been keeping spam off my site for a year now, and works really well. I have not had problems with false positives, as Akismet has had. I have not had problems with false negatives, as keyword-based filtering has had. Spambots may get more sophisticated, but their software complexity will have to increase orders of magnitude before they can break this method.
See also
- Negative CAPTCHA, Damien’s post about a similar technique. The comments on that post got me energized to write up my technique.
- My blog, where many software engineering topics are discussed.
Comments
One thing that you didn't mention and that I consider valid is checking that a minimum interval of time passes between the request of the page containing the form and its submission. Humans just won't (or shouldn't) spend less than, say, 15 seconds reading or browsing a page before submitting content via a form, while bots don't need that time at all.
So checking a 'form-generated-at' timestamp vs. a 'form-submitted-at' one and rejecting the post when those are too close makes a good bot detection method. What do you think?
It's not a bad as a CAPTCHA either since it actually relates to the content they are commenting on, and would require human spammers to at least read some of the post slowing down that rate of spam.
I ended up turning off unregistered user posting, but I was really, really irritated by the experience.
Here's my own logic: If we can make it (near?) impossible for human spammers to suceed, we can happily forget about spambots, as the bot problem would be solved. So perhaps we need to focus on making sure that humans who enter comments do so out of legitimate motivation.
The multiple choice method mentioned above is a strong pointer in that direction.
A nice twist to mort's delay method would be to not let the user know how much delay is required in error messages. If a user knows it's 15 seconds he's got something systematic to work with (in a way we might not want). Also, I'd suggest randomly varying the delay time for each request.
Skip
(3) if the email address is invalid, switch to displaying a "hit the back button and check your email address" message; this stops bots and people who give a ficticious email address when they post...
The function should be randomized, e.g. using different constants, loops, math operations, so it would require the spambot to evaluate it every time.
The spambots based on e.g. firefox+greasemonkey would make it relatively easy to break this protection, but even then the spammers would need more resource (JS is not exactly fast) for spamming than they do now. Or so I hope.
In my opinion Javascript is best disabled - I use the Firefox NoScript plugin for this.
Also I think that your approach only works if the robot's screen is different from the user. I'm really tempted to try a script in Autohotkey (www.autohotkey.com) that does the following:
1. Search the site for the comment section "add a comment:",then "name:", "email:","www:".
2. Then go back to "name:" and move the mouse pointer 300 pixels to the right and click.
3. Enter name, TAB, email, TAB, TAB,Spam text,TAB,Enter
4. Reset DSL-Line for new IP
5. Reset Cookies
6. Reload your page
7. Wait 10 seconds
8. Goto #1
I wouldn't be surprised if the comment list would fill up fast.
First of all, congratulations for the strategy, I am using something similar, but quite simpler, and now will improve it.
One question: could you please explain what is the purpose of the PHPSESSID hidden field? I doubt that you are using it, because it would mean one cannot post if session cookies are disabled or if the session has expired. So what is the benefit of this field?
Thank you.
----
P.S. Well well well -
"You took a long time entering this post. Please preview it and submit it again."
This actually means that everyone who posts a comment after spending some time reading thoroughly your article will have to submit twice. Why not move the slider more to the usability side, after all, website are made FOR the people, not AGAINST spam bots.
In addition, please make the page position itself on the submitted comment or on the unsubmitted form after posting.
Best regards and good luck!
Dimo
Advanced Textual Confirmation
http://bbantispam.com/atc/
It works very well, easy to installation and no complaints from the visitors.
1. when the form page is served a token is created together with a timestamp - these are stored in a text file. When a submission is received the system checks:
a) does the token exist
b) was the submission too slow or quick for a human
2. I reject submissions containing the string "http://" and ask the user to remove those from the web addresses.
Another thought I had was using javascript requiring very simple human interaction - focusing or clicking on the page would pull all of the form fields from a linked (but dynamically generated) javascript file using 'document.write' to output the code to the page. That would require that not only the spambot be capable of running javascript, but also that it would have to download all linked .js files to the page, and actually perform an interaction with the page to write out the randomly generated form fields.
"The timestamp is checked. If it is too far in the past, or if it is in the future, the form is invalid. Of course a missing or non-integer timestamp is also a deal-breaker."
If you have hashed the timestamp in the main hash, if someone screws around with the hidden timestamp field, it will never match the created hash.
So while you can still check how long it has been since that hash was generated, it would be impossible for someone to generate a future hash without knowing your secret/hash parts.
Things I noticed on this blog are that honeypots of this blog do not seem to have:
- the value intialized.
- the id.
So, if they have id='foo' and value='bar', writing a spambot to detect honeypots on this site would be even more difficult.
----
#!/bin/bash
#
# Anyone who requests this CGI gets added to the "Deny from" line in
# two files, .htaccess and .htaccess_recent.
#
# First put this CGI in your robots.txt file to prevent legitimate spiders from
# finding it. You should wait about a day since many spiders (such as Google)
# only request robots.txt once per day.
# Next, put empty links to it in various HTML pages to trick
# bad spiders into tripping it.
# After you've verified that it's working correctly, you can change
# the htaccess file below to point to a real .htaccess file that blocks
# sites.
#
# Use a cron job to run spamtrap_rotate_lists.sh to periodically move .htaccess_recent over .htaccess and recreate
# it as an empty htaccess file. This prevents .htaccess from filling up
# with old addresses.
# You must edit spamtrap_rotate_lists.sh to set the correct directory
# and file names.
#
file="htaccess" #change this to the real one.
function error() {
echo "
File error adding entry to $1.
"exit
}
echo "Content-type: text/html"
echo
echo ""
if grep -q "$REMOTE_ADDR" ${file}
then
echo "
Already in the list.
"exit
fi
sed -i "s/Deny from.*/& ${REMOTE_ADDR}/" ${file} 2>&1 || error "${file}"
sed -i "s/Deny from.*/& ${REMOTE_ADDR}/" ${file}_recent 2>&1 || error "${file}_recent"
echo "
Your address $REMOTE_ADDR will now be blocked from this site. This is a trap for automated spiders that do not honor the robots.txt file. Email "webmaster" at this site for details.
"----
file rotation script (run monthly as a cron job):
----
#!/bin/bash
dir=/usr/lib/cgi-bin/spamtrap
file=${dir}/htaccess
file_recent=${file}_recent
file_default=${file}_default
mv ${file_recent} ${file}
cp ${file_default} ${file_recent}
----
htaccess_default:
----
# htaccess_defaut
# 88.151.114.* is webbot.org (webbot.ru)
# you could add more known spamming domains if you want:
Deny from 88.151.114.*
http://recaptcha.net/resources.html
I've just created a new possibility to use it's functionality:
http://code.google.com/p/mailhide-tag/
It is a JSP tag which helps developers to hide mail address from spambots.
I think the methods you suggest are valid and for most, will work outstandingly.
Like with captcha, the main point is that we don't want to create traps into which existing bots fall, but traps which are completely unavoidable, even if you know the drill. That's a bit higher mathematics.
The problem is, that IF the spammer is intrested enough in your site, he/she WILL write a script to defeat all these ordinary methods. The point of captcha is to force the user make something that a computer just is not able to repeat, even if tried to teach it so.
Of course, the quest for a perfect captcha is still on. Google has done quite good, with almost 100% accuracy.
I am currently doing battle with these things and have them beat for now with custom CAPTCHAS but...
Based on the info here I will have to take further measures in the future.
As my studio manages a number of sites I want my next solution to be comprehensive and more robust.
After discovering the the bots are only registering at my site to promote a web site URL for search engine ranking, I have modified the 'register new user' script to barf an error message and fail the registration if the new user submits a web site URL. So far I have only had 2 people advise me of the problem when registering so I just update their web site URL manually. Very simple and very effective.
I'm developing an alternative technique called Pictcha.
It is a protection in a form of an image retrieved from UTYP engine, which can be embedded inside Web Forms, and which will filter out various spams in a more user friendly way than the well known captchas.
http://nthinking.net/miss/pictcha.html (simple JavaScript implementation)
http://nthinking.net/miss/pictcha-sample.php (server side PHP implementation)
, there is also a PHP lib for server
And as a bonus, it is learning. Thus it recycles the tremendous waste of concentration which is the conventional Form Validation by text recognition.
You may check the lab page to get more details about it:
http://nthinking.net/miss/lab.html
I also find this "submitting twice" thing annoying, as Dimo said. But perhaps not too annoying. At least this is better than sites that outright discard your comments when you submit, when you just took one hour to write your comment and have not saved it somewhere.
In all, I see this (and the current status of email spam) as that the spammers have already won. I believe the spammers' true intention is to destroy the web and email as a viable means of communications, and we have, hopefully only reluctantly, in fact helped them achieve their real goals.
I find rel="nofollow" to be somewhat rude to the commenters.
I have an idea I haven't tried yet: use rel="nofollow" for all new comments by default and remove it when the comment is marked as non-spam. With all that being clearly indicated.
We in Web-APP, will absolutely apply some of your tips.
Thanks
On
http://www.web-app.net
In particular, I LIKE text-based browsers BECAUSE they don't display graphics, unreadable colour schemes, or have an expensive scripting engine.
It's just too bad that the (latest) HTML standards now pretty much require the DOM.
Time-based analysis may fail in cases where the form does things like time out, and the user is forced to copy&paste their comment. (I have had to do that for one site)
You make a CSS/hidden field. To that you make a CSS/hidden label saying @Write the number eight.
Now, if the browser supports CSS, the field will not be seen, therefore you have to check for empty field.
But if the browser does not understand CSS, the field will be labeled Write the number eight.
On the serverside, you then have to check for either empty field or the number 8. This number should be random every time, or, based on the IP. The CAPTCHA method like this works 100% in basic, and it is quite userfriendly.
To get the stupid bots (meaning in general russians) even before this, you make a form and some fields in an HTML remark. Then you check serverside that these fields are not filled out, since no browser will execute this.
Furthermore, you put a single entity in the submit/button, and checks that this has been translated into a letter on the serverside. F.eks. S#entity#nd comment
About the time limit. The new isreal bots are quite clever in this matter. They will wait up til 30 seconds from GET to POST according to our statistics. Also they will act before this like they are in fact browsing the pages. So it will not stop them.
Counting hrefs. The new way of spamming goes like
Hey, I am looking for |product|. Anyone knows where i can find it
And either another bot or an (innocent) user on the forum will give the |answer|
I do not consider blocking based on content a good idea.
Lastly.
Some stupid bots can be blocked in whole or from some content, if they contains a suspicious useragent string or an empty string.
Like
JAVA meaning they are probably harvesters or
ru indicating it is russian | goes for acceptlang as well, check here for cn meaning china as well
also if it doesnt understand GZIP, it is suspicious, since even LYNX does.
All of it can be combined with javascript and cookies, which is hard for most bots to understand.
This is not all, of course. Just what I can remember from some analysis we made on our side. There are special ways of treating referer spammers and vulnerability scanners and injectors and so on.
I randomize form field names twice a day, a cron job updates a file with fields such as email=SKJEJFDLFKDLKFF, name=WIWIJDSLKSLSNBV, comment=SKJDSKJDKJSKD, etc. The tokens at the end are randomized. The program that outputs the form looks up the proper token and outputs the token instead of an obvious form field name. The process is reversed on the way back in.
I also set cookie when form is requested and check for same cookie when form subnitted. Amazing how much spam this eliminates on it's own.
Never allow CC: BCC: or http: (or variations) in any thing that may forward an e-mail. This also eliminates a lot of attempts since without a link a lot of spam is kind of useless.
One of the form fields is also a timestamp, usually time() but doctored up so that it's not obvious that it's the number of seconds since 1970. The program processing the form gives an error if it compares it to the current time() and it is greater, or too much in the past. I also started inserting one letter (a random letter that changes every 12 hours) into the string, if that letter doesn't come back or has more letters or no letters, again an error. So you are a bot and you see a hidden form field like name=HDJJWIJDJWIJDIWJ value=27738G84 - tell me, what would you put into it?
I'd be happy to clarify if I typed this too fast.
What techniques would you use for EMAIL submitted comments, such as mailing lists, to prevent spam? I run a mailing list and I want to prevent spammers from sending mail to the list. If I deny or kickout one email address, they come back with another.
I was considering making it so each time you post to the list you get a confirmation link that you need to click on to confirm the email is legit and then do a CAPTCHA to make sure you are real. I was also considering trying to incorporate a SPAM COP like list to screen IP addresses. I dont want to make it difficult for people or annoying to use the list, but if it is riddled with spammers it is useless.
Thanks for your thoughts.
Best,
Josh
thanks
http://www.webdonuts.com/2009/10/spambot/
After form is submitted, use javascript to display a confirmation link, like "Please click the link to confirm you're human'. That link would be inserted via javascript, and thus would not be in the DOM (therefore, inaccessible to scrapes). If then that link is clicked, form is posted.
Obviously hashing the field names inside javascript gives away your salt, you could send a ajax request to php to spit out the values for you but this leaves a way for bots to grab your names AND a extra http request.
I'm guessing I'm gonna end up using something like descendant selectors.
Name: unmrab
Phone: LrASqJYNJhod
Enquiry: 21vy8X href="http://kinqjyhxukwb.com/">kinqjyhxukwb>,
[url=http://ekhouqnuhrzp.com/]ekhouqnuhrzp[/url],
[link=http://dkwclumezmns.com/]dkwclumezmns[/link],
http://ayzwqhockubx.com/
Does this look like the work of a spambot, or do you think something else is at work here?
http://www.isegura.es/blog/stop-spam-your-site-being-invisible-honeytrap-drupal-comments-form
We're also working on a Wordpress implementation, as it has been a success.
Security Vs Usability Trade Off I guess....
Include a "i am human" button, once clicked it inserts a new (random selection) label+field into a div via a ajax call, then this has to be filled with the question in the label.
I currently use a simple way to fool spambots on my sites:
I made 6 images with a number range, one of the images gets randomly called when page is opened with the forum. Users must fill in this form into a field and once clicking submit it would check via ajax if number is ok. If scripts are disabled then the number is shown instead of the field. In the back end the number is verified before further processing.
Here are some other approaches (too bad I can't use <LI> tags here):
* Create a form with unusual field names.
* Create a varying form, where field names and prefilled data is varied.
* Include a simple text CAPTCHA (such as "Type the answer of five plus three in digits" or "Enter your favorite sport (hint: I believe its name must start with 'g')").
* Use the "spinner" if you want to.
* Use the Preview/Submit order like you use here.
* Simply do not use HTTP! Make comments sending by telnet or gopher protocols. Spambots won't send anything using those protocols.
like form name field .. create a image with "NAME" on the fly ... similar for others ... and randomize the sequence of images each time user wants to register of login .. probability greatly decreases that a bot can fill the form correctly ...
http://www.myjqueryplugins.com/QapTcha/demo
a very nice 'human' way of activating the form. The user doesn't have to do any math / or read any of the annoying hard to read Captcha texts. This one gives by far an easy user experience. (Because it's jquery, it'll also work on ipads etc. Something that most of the Flash alternatives here don't offer. Maybe something to consider as well. :)
Currently use below where email-confirmation is the hidden field.
if(!String.IsNullOrEmpty(Request.Form["email-confirmation"]))
IgnoreComment();
That doesnt work.
Thanks
You check if the string NOT containig anything and do IgnoreComments
It should be
if(String.IsNullOrEmpty(Request.Form["email-confirmation"]))
IgnoreComment();
Cant seem to make this work!
Microsoft VBScript compilation error '800a03ee'
Expected ')'
/xcontact.asp, line 425
if(String.IsNullOrEmpty(Request.Form["email-confirmation"]))
------------------------------------^
Sorry for being so thick!
If Not String.IsNullOrEmpty(Request.Form("Surname")) Then
'your code here...
End If
If String.IsNullOrEmpty(Request.Form("Surname")) Then
'Form value is empty
End If
Thanks
Honeypot fields might be filled somehow even without real users seeing them, thus producing flaws in the testing process.
Perhaps an should be adopted on the whole form tag.
One other thought: This won't degrade gracefully, but if you are submitting the form via javascript, you can also leave the action out of the form, or put some dummy url in where the form will post. Then specify the url where you wish to post the form data via the javascript itself.
What do you think about using all of the methods above and incorporating a CAPTCHA as a honeypot? The user doesn't see the CAPTCHA or need to fill it in, but bots will attempt to do so and fail every time.
action url does not in fact submit to the page that commits the comment, it in fact submits to a page that only has an html meta redirection (http 200 not 3**). A spam bot will not follow that redirect, and the comment will be silently lost. There are some implementation details that can make such a thing safer.
I assume spammers would not use a cookie that will be filtered. Until this week works well, but now I've started to receive spam e-mails again. I will replace the random number with a hash version
Implement that on my site! Thanks!
Add a comment: