neilgunton 
About  Sitemap 

 Home  My  Articles   Links  Search  Resume  Contact  Website
 Contents  Status  Latest  Guestbook  Printable  Edit  Search  Bookmark  RSS

Stopping Spambots: A Spambot Trap


Using Linux, Apache, mod_perl, Perl, MySQL, iptables and Embperl

Topic: Neil Gunton  
Categories: Perl utilities, Spam prevention

Permalink: http://www.neilgunton.com/doc/spambot_trap

Copyright © 2002-2014 By Neil Gunton

Last update: Friday May 14, 2010 10:32 (US/Pacific)


Table of Contents

The Spambot Trap

      Introduction
      The Problem: Spambots Ate My Website
      Overview of the Spambot Trap
      Banishing 'mailto:'
      MySQL
      BlockAgent.pm
      iptables
      badhosts_loop
      spambot_trap/ Directory
      robots.txt
      Your HTML Files
            Embperl
      httpd.conf
      Monitoring
      Conclusions
            Strengths
            Weaknesses
            Possible future enhancements
      Alternative Ideas

Appendices

      Update 1 April 26th 2002: Evolution in action - the Spambots Strike Back
      Update 2 May 1st 2002: The author of the new, scary spambot comes forward
      Update 3 June 29th 2002: A possible way to stop spambots that pretend to be browsers
      Update 4 October 15th 2005: Dealing with Googlebot 2.1, which pretends to be a browser
      Update 5 October 21st 2006 - The Attack of the Botnet
      Update 6 November 30th 2006 - Defending against botnets
      Update 7 August 2009: Spotting new botnets

Links

      Information Resources, Tips and How-to's
      Organizations and Blocklists
      Server-side tools
      Email obfuscation tools
      Email filters
      Anti-Spam email services
      Research projects
      Novel ideas

The Spambot Trap

        
        

Introduction

                
This document describes my experiences with spambots on my websites, and the techniques I have developed to stop them dead. I assume the reader has basic familiarity with Linux, Apache, mod_perl, Perl, MySQL and firewall rules using iptables - each of these topics could fill a book, so I won't talk about installation or basic configuration. I will, however, provide full scripts and instructions on using these within the context of these tools. If you'd like some basic pointers on getting set up using these tools, then you could take a look at my short series of three Linux Network Howto articles.

Updates:

2002-04-12: I've had a lot of feedback since the original slashdot article - thanks to all those people who have written with some really great ideas for alternative ways to foil the spambots. I've tried to incorporate some of these into the document, and also to give credit to people by name where it's due. Thanks again!
2002-04-26: There's a new update on how the spambots seem to be "evolving" to avoid traps.
2002-05-01: Another update: The author of the new, scary spambot comes forward, and a few new links.
2002-06-29: One possible way to foil spambots that pretend to be browsers.
2002-09-05: I've added a new log snapshots appendix which shows snapshots of the badhosts_loop logs at various moments in time.
2005-10-15: New version of the badhosts_loop script, fixing a problem where spambots could connect to the website during the brief periods when the iptables chains are being regenerated. Also moved from the older ipchains to the newer iptables. Finally, added Update 4 on how to deal with Google's new bot, which pretends to be a browser.
        
        

The Problem: Spambots Ate My Website

                
Spambot: (noun) - A software program that browses websites looking for email addresses, which it then "harvests" and collects into large lists. These lists are then either used directly for marketing purposes, or else sold, often in the form of CD-ROMs packed with millions of addresses. To add insult to injury, you may then receive spam emails which ask you to buy one of these lists (or even the spambot itself). Spambots (and spam) are a pestilence which needs to be stamped out wherever it is found.

I have a website, crazyguyonabike.com, which has bicycle tour journals, message boards and guestbooks. I started noticing around the end of 2001 that the site was getting hit a lot by spambots. You can spot this sort of activity by looking for very rapid surfing, strange request patterns, and non-browser User-Agents.

After looking at the server logs, I realized a couple of things: Firstly, the spambots came from many different IP addresses, so this precluded the simple option of adding the source IP to my firewall blocks list. Secondly, there seemed to be a common behavior between the bots - even if this was the first visit from a particular IP address (or even a particular network, so no chance of just being a different proxy) they would come straight into the middle of my website, at a specific page rather than the root. This means that the spambots obviously had some kind of database of pages, which had presumably been built up from previous visits (before I'd noticed the activity), and this database was being shared between a large number of different hosts, each of which was apparently running the same software. Of course, this "database" could also simply be search engine results - apparently some spambots actually use search engines such as Google and Yahoo! to look for promising pages.

Another distinctive behavior was that the spambots follow only those links which had certain keywords which would seem promising if you're looking for email addresses: "guestbook", "journal", "message", "post" and so on. On each of the pages in my site there were many other links in the navbars, but only links with these keywords were being followed. Also, robots.txt was never even being read, let alone followed. Moreover, the bot would come in, scan pages rapidly for maybe a few seconds, and then stop for a while. So it was obviously making at least some attempt to circumvent blocks based on frequency/quantity of requests.

This was very annoying. For one thing, these things were picking off email addresses from my website (at that point, I was letting people who posted on my message boards decide for themselves whether they wanted their email addresses to be visible or not). But quite apart from that, it was taking up resources, and was just plain rude. I hate spam. I resent my webserver having to play host to people whose obvious goal is to cynically exploit the co-operative protocols of the internet to their own selfish, antisocial gain. So, I decided to do something about it.

The first thing I did was to look at the User-Agent fields which were being used by the bots. There were a variety, including variations on the following (note: these particular user agents have long since disappeared - check the log snapshots for recent logs of the user agents that current spambots are using):

  • DSurf15a 01
  • PSurf15a VA
  • SSurf15a 11
  • DBrowse 1.4b
  • PBrowse 1.4b
  • UJTBYFWGYA (and other strings of random capital letters)

I searched the internet for references to these strings, but all I found was a slew of website statistics analysis logs. This meant that these particular spambots obviously got around. It was also discouraging, because there was no mention anywhere of what these things actually were. I was surprised that there seemed to be no discussion whatsoever of something that seemed to be pandemic. Then I found a couple of other websites with guestbooks that had actually been defiled by these spambots: (if you follow these links and you don't see a lot of empty messages left by the above user agents, then that means the webmaster of the site has finally found a way to stop it, so good for them...)

I reckon the spambots didn't really intend to leave empty messages. They just tend to want to follow links with the keyword 'post'. So if the guestbook posting form has no preview or confirmation page, then the spambot would leave a message simply by following this link! My guestbooks and message boards have a preview page, which is probably why I hadn't had any of this.

Anyway, I started thinking about what kind of program this thing was. First of all, it comes from all kinds of different IP addresses. I couldn't quite believe that this many different IP addresses were all intentionally using the same software, of which I could find absolutely no mention anywhere on the Web. This made me think it might be some kind of virus/trojan/worm or whatever that silently installed itself on people's computers, and then used the CPU and bandwidth to surf the Web without the owner being aware of it. I thought that if this was the case, then it must be sending the results somewhere - and if we could find out where, then we could go about shutting the operation down.

The only website which I have found associated with any of the IP addresses which have hit me is http://cheap-shells.com/. Looking at the site, this seems to be a company that supplies, uh, cheap shells. Four separate IP addresses hosted by this outfit seemed to have been sources of spambot visits at different times. (Update: Since mentioning this in the original version of the article, the people who run cheap-shells.com have assured me that this was an errant customer (who's since been dropped), running some kind of freely available spider, rather than being a virus. That's big progress! So, it seems that this thing is not a virus or trojan after all... no word yet however on the real identity of the spider program...)

Regardless of what the thing is, I have had no luck at all in getting any help from the sysadmins at ISP's I have contacted. A typical exchange was the one with a guy at Cox internet, which was where a persistent offending IP address was sourced. He just couldn't be bothered, and eventually told me that spidering was not against the law, or their terms of service. I asked whether actions which were blatantly obviously geared toward the generation of spam were against their terms of use, but he never replied to that. I had no more luck anywhere else: Nobody had heard of this thing. I even sent an email to CERT, but no response. If you have any ideas, I'd love to hear them.

So anyway, I turned instead to thinking about how I could erase these pests from my life as much as possible. This document is about my quest to stop spambots (not just this one, but ALL spambots) from abusing my website. Hopefully it will be useful to you.

        
        

Overview of the Spambot Trap

                
There are three main parts to the technique which I outline here:

  1. Banish visible email addresses from your websites altogether, or at the very least obfuscate them so they can't be harvested. Examples of how to do this are given. This is your fail-safe, in case the spambots figure out a way around your other defences. Even if they manage to cruise your website on their very best behavior, they still should not be able to harvest email addresses!

  2. Block known spambots: Certain User-Agents are just known to be bad, so there's no reason to let them come on your site at all. True, spambots could in theory spoof the User-Agent, but the simple reality is that a lot of them don't. We use an enhanced version of the BlockAgent.pm module from the O'Reilly mod_perl book. This extension adds offending IP addresses to a MySQL (or other relational) database, which is picked up by the third part of our cunning system...

  3. Set a Spambot Trap, which blocks hosts based on behavior. We set a trap for spambots, which normal users with browsers and well-behaved spiders should not fall into. If the bot falls in the trap, then its IP address is quickly blocked from all further connections to the webserver.

    This works using a persistent, looping Perl script called badhosts_loop, which checks every few seconds for additions to a 'badhosts' database. This script then adds 'DENY' rules for each bad host to the iptables firewall. Blocks have an expiry, which is initially set to one minute. If a host falls in the trap again after the block expires, then that IP is blocked again - and the expiration time is doubled to 2 minutes. And so on. However, the minimum block period in practice is more like 10 minutes, since this is the cycle time of the badhosts_loop script if nothing else is going on. This algorithm ensures that the worst offenders get progressively more blocked, while one-time offenders don't stick around in our firewall rules eating up resources.

There are various components to the Spambot Trap, including the badhosts_loop Perl script, the BlockAgent.pm module, iptables config, MySQL database, httpd.conf, robots.txt, and your HTML files. These are all covered in the sections below.

        
        

Banishing 'mailto:'

                
The first and most urgent thing you need to do is to get email addresses off your website altogether. This means, unfortunately, banishing the venerable mailto: link. It's a real shame that perfectly good mechanisms should be removed because of abuse, but that's just the way the world is these days. You need to be defensive, and assume that the spammers will try to take advantage of your resources as much as possible.

It's an arms race

The important thing that you need to realize is that no matter what blocks we put in place, this game is an arms race. Eventually the spambot writers will develop smarter bots which circumvent our techniques. Therefore you want to have a failsafe, which will prevent email addresses from getting into the hands of the spambot even if all else fails. The only real way to do that is to completely remove all email address from your website.

Contact forms

You should replace the mailto: links with links to a special form where people can type their name, email address and message. A CGI can then deliver the email, and your email address never has to be disclosed. There are a number of different mailer scripts out there - just be careful to check for vulnerabilities which could allow malicious users to use the form to send email to third parties (i.e. spam, ironically enough) using your server. The formmail script is popular, but it apparently still has vulnerabilities (thanks to Christopher Fisk for pointing this out, and Nicholas Clark for pointing me to a good replacement for formmail). Another alternative is Soupermail (thanks to Emery Jeffreys). The Embperl package has a simple MailFormTo command to send an email from a form.

Since I have seen guestbooks out there which have been extensively defiled by spambots, I would add that you should have a preview screen on your contact forms. This will ensure that an email doesn't get fired off simply by a spambot following the 'post' or 'contact' link (which it will likely try to do).

Alternatives to totally banishing mailto:

There are alternatives to completely removing email addresses, but they all depend on the stupidity of the spambot, and so could be compromised by a new generation of pest. These include:

  • Write out email addresses in a non-email format, e.g. instead of writing 'user@example.com' you would write 'user at example dot com', or something similar. It would only take some spambot with a little more intelligence to be able to scan these patterns and pick up "likely" addresses, so this strategy is a little risky. Any consistent method you choose to write out email addresses could in theory be analyzed and decoded by a savvy bot.

  • URL-encode email addresses (suggested by Anthony Martin). Most browsers allow the mailto: URL to contain URL Encoded values: The string of "%40" equals the at "@" symbol, while "%2E" equals period. For that matter, you could URL encode the entire address, name, host, domain, so it's one long encoded string. This is something that might work short term, but it's relatively easy for spambots to get smarter to decode this.

  • Use HTML character entities email addresses (suggested by Seann Herdejurgen). Similar to the previous method. For example, <a href=&#109;ailto&#58;user&#64;example&#46;com>user&#64;example&#46;com</a>

  • Add stuff to the email address to make it invalid, but so that a human could easily know what to do to make it work. An example of this is writing 'username@_NO_SPAM_example.com'. You need to remove the "_NO_SPAM_" part to make the email address valid. You can have some kind of explanation to make it clear what people have to do to use the address. Personally, I don't like this - you're depending on a level of sophistication on the part of your users which is risky. In my experience, there are a lot of very 'novice' level users out there, who only know how to click on a link. They don't know how to edit an email address. Heck, I've had people come to my site by typing the URL into Google, rather than the 'Location' box of their browser. Also, people don't read instructions.

  • Make graphics images which contain the email address. Spambots usually don't download graphics, and even if they did, they probably couldn't decode the bits to get the text. However, they could do it in theory, since software for doing OCR (optical character recognition, getting text from scanned documents) has been around for a while. A downside to this approach is that the user has to manually copy down the email address, since it can't be cut'n'pasted. Also, you can't put a mailto: link on the image, otherwise you're back to square one. Finally, blind people (who use braille browsers) will have a BIG problem with pure graphics (unless you put in some kind of ALT text, using the techniques previously mentioned to obfuscate the email address). You could also put a link to a contact form, with an argument in the link telling your server internally what email address to use. For example, the link could say "contact.cgi?to=23", where '23' is some database key to the actual email address. But the downside here is that you still need to generate the image, which is a bit of a pain in the ass if you have a lot of them. You can do it automatically, if you're willing to put the work in and write the scripts. There are some very nice graphics generation packages out there on CPAN for Perl. Here's an example of an email address presented as an image:

    Robert Logan tells me that the PBM package (which seems to be packaged with Linux) is a great way to generate these graphics, for example:

    	    shell> echo user@example.com | pbmtext | pnmcrop | pnmpad -white -l2 -r2 -t2 -b2 > email.pnm
    	    shell> convert email.pnm email.gif
    	    

    This produces the following, which looks pretty neat and tidy:

    An alternative to this (suggested by Andrew Park) is to just make certain characters into graphics, which can then be used again and again for all kinds of email addresses. For example, you could make a GIF of the '@' symbol, and possibly other common parts such as ".com" and ".org". If you have code on the server side that can then automatically convert email addresses into the appropriate HTML, then this will fool most spambots (for now!).

  • Use JavaScript to make your email links hard to recognise for spambots. I personally don't like my site to be dependent on JavaScript, since I turn it off in my own browser (mostly for security reasons and to avoid the popup and popunder ads). But, there have been a number of methods suggested for doing this, for example:

    • From Marcell Toth:
      <html>
      <script language="javascript">
      function SendMail(Login, Server)
      {
      	window.navigate("mailto:" + Login + "@" + Server);
      }
      </script>
      <body>
      	<a href="javascript::SendMail('marcell.toth', 'nextra.hu')">Mail me</a>
      </body>
      </html>
      		

    • A JavaScript email encryptor (thanks to Joe Tucek for the link)

    • From Brandon Gillespie:
      There is a fourth means of dealing with the mailto: link I didn't see mentioned,
      but which I have had good success with. Instead of doing href="mailto:foo@bar" you
      create an obfuscated javascript function for each domain (for me they are all mailed
      to the same domain, so its easy), like:
      
      function m_sfcon (u) {
          pre = "mail";
          url = pre + "to:" + u;
          document.location.href = url + "@sfcon.org";
      }
      
      Then use:
      
      href="javascript:m_sfcon('myusername')"
      		

  • Some other interesting ideas:

    • From Thomas "Balu" Walter:
      While working on my new hompepage I found myself asking me how to defend against those bots.
      I didn't want to break my eMail-address or to hide it using javascript or images -
      especially because my visitors should be able to use mailto: links as expected. 
      
      My provider set up a "catchall" mailbox where all mails are stored that are sent
      to my domain @example.com. Since I am developing my pages using PHP I thought of
      a way to make them unique for each visitor. The result was the following small function:
      
      function generateMail(){
          global $HTTP_SERVER_VARS;
      
          // is a proxy in use?
          if ($HTTP_SERVER_VARS["HTTP_X_FORWARDED_FOR"]) {
              $ip = $HTTP_SERVER_VARS["HTTP_X_FORWARDED_FOR"];
          } else {
              $ip = $HTTP_SERVER_VARS["REMOTE_ADDR"];
          }
      
          return "web-".sprintf("%u", ip2long($ip)).".".time()."@example.com";
      }
      
      This generates an address in the form
      
         web-32bitIP.timestamp@example.com
      
      This way I can easily reject addresses that were found by bots and are used for SPAMming.
      I even know where the bot came from and when. I can even find them in the webserver-logfiles
      and analyze their activity.
      		

    • From Ilmari Karonen (in response to the update regarding newer versions of spambots which use google to find pages, and then follow no links on my website, thus foiling any link traps):
      If the spambot is indeed not following links, an obvious solution is to
      feed all mailto: links through a redirector script.
      
      On a site I'm currently building, I'm doing the following:
      
       1. All email links are given as "/email/?h=host&u=user".
       2. The directory /email is disallowed in robots.txt.
       3. Any URL under /email which is _not_ in the above format acts as a
          spambot trap.
       4. All pages contain links to "/email/something_random.html".
      
      This works great as long as there are no e-mail addresses visible on the
      page.  I'm currently obfuscating those by inserting the HTML code
      
        <font size="1" color="white" style="font-size: 1px;">X</font>
      
      on either side of the @ sign.  I figure a bot has to be pretty damn
      clever to de-obfuscate that, while it's pretty obvious to a human even
      if the CSS hiding trick fails.
      		

    • From Kamil Prusko:
      Code below works even without activated javascript - then it will redirect to the page with a form.
      
      RULES
      
      - if the 'name' attribute begins with '+', the content in the tag will be replaced by email address.
      - id="ghostaddr" is required, and after ghostaddr() call this id is removed. So don't bind css to this id.
      
      
      I don't think spambots aren't such smart to threat a ' ' as '@' :-)
      
      ----------------------------------------------------------------
      
      <html>
      <body>
      
        <a href="contact.php?23" name="+me domain.net" id="ghostaddr">email me</a><br>
        <a href="contact.php?53" name="judy domain.net" id="ghostaddr">Judy</a>
      
      </body>
      
      
      <script type="application/x-javascript">
      <!--
      
      function ghostaddr()
      {
        var at, pos;
      
        while ( obj = document.getElementById( 'ghostaddr' ) )
        {
            pos = obj.name.indexOf(' ', 0);
      
            if ( pos <= 0 ) {
                 obj.id = null;
                 continue;}
                               if ( obj.name.charAt(0) == '+' ) {
                 at = obj.name.substring( 1, pos )+ '@'+ obj.name.substring( pos+1, obj.name.length );
                 obj.innerHTML = at;
            }else
                 at = obj.name.substring( 0, pos )+ '@'+ obj.name.substring( pos+1, obj.name.length );
                  obj.href = 'mail'+ /*Die you spambot; to:*/ 'to:'+ at;
            obj.id = null;
        }
      }
      
      ghostaddr();
      
      -->
      </script>
      </html>
      		

As you can see, there are many ways you can make email addresses harder for spambots to recognise. It all depends on your own expertise and preferences. Still, in my opinion the only totally safe way to ensure spambots can't harvest email addresses is to totally remove them from your website! Can't get around that one, no matter how smart they get...

Next, we look at setting up the tools required to make our spambot trap work.

        
        

MySQL

                

Bookmark | Edit | | Report | Link
File: badhosts.dump
Type: ASCII text
Size: 743 bytes

To implement our spambot trap, we first need to set up a MySQL database, where we store records of the hosts which are to be blocked. This doesn't have to be MySQL, but I use it because it's extremely fast, and very appropriate for this kind of application. You need to create a new database, called 'badhosts'. You then create a table, again called 'badhosts', with the following structure:

Field Type Indexed Comment
id tinyint unsigned not null Primary Unique key for the record
ip_address char(16) not null Yes The IP address of the host to be blocked
type enum('agent','trap','manual') not null No How the agent was blocked - because it was a known bad agent, or it fell into the trap, or it was added manually using the block_ip script (below)
reason char(200) not null No The reason for the block - usually the HTTP User-Agent of the spambot, or else text explanation for a manual block
power tinyint unsigned not null No The power of two to use when calculating the number of minutes to block for. For example, 2 to the power of 3 is eight minutes. Power is incremented every time a new block has to be created for a particular IP address (i.e. the block time is doubled).
created datetime not null Yes When this block was created
expiry datetime not null Yes When this block expires (calculated by adding 2^power minutes onto the created time)

You could use the dump provided above to load directly into your database:

	shell> mysqladmin create badhosts
	shell> mysql badhosts < badhosts.dump
That's about it! The fields which are marked as 'indexed' are the only ones which need indexes, because they are searched on to see if a particular IP address has been previously blocked, and also to see which blocks should be removed because they've expired.

If you have access privileges set on your MySQL databases, then you need to allow the Apache user (usually 'nobody') access. The other script that will require access is badhosts_loop, which runs as root. Next, we look at the script that populates this database.

        
        

BlockAgent.pm

                

Bookmark | Edit | | Report | Link
File: BlockAgent.pm
Type: Perl5 module source text
Size: 8 KB

Perl module for using from Apache/mod_perl

Bookmark | Edit | | Report | Link
File: bad_agents.txt
Type: ASCII text
Size: 1 KB

Example config file for specifying bad user agents

Note: This module currently only works with Apache 1.3.x, not the newer Apache 2.x. This is because the Apache team changed the module API for Version 2, and I haven't had the time to figure out exactly what needs to be done to make it work with both versions (if that's even possible). If anyone has any clues on this or wants to take a stab at it, then let me know... meanwhile, apologies to those of you running Apache 2.0.


The BlockAgent.pm Apache/mod_perl module is taken from the excellent book "Writing Apache Modules with Perl and C" by Lincoln Stein & Doug MacEachern (O'Reilly). This script basically acts as an Apache authentication module which checks the HTTP User-Agent header against a list of known bad agents. If there's a match, then a 403 'Forbidden' code is returned. The script compiles and caches a list of subroutines for doing the matches, and automatically detects when the 'bad_agents.txt' file has changed. I have found that it has no noticeable impact on the performance of the webserver. This script is useful in the case where you know for certain that a certain User-Agent is bad; there's no point in letting it go anywhere on your site, so it's a good first line of defense. We'll cover how to add this module to your website a little later, along with the rest of the configuration settings in the section on httpd.conf.

Of course, one of the first arguments you'll see with regard to this method of blocking spambots is that it's easy to circumvent, by simply passing in a User-Agent string which is identical to the major browsers out there. This is perfectly true, but don't ask me why the spambot writers haven't done this - maybe it's a question of pride or ego, they want to see their baby out there on record in Web server logs. I honestly don't know. The main point is that at present, the User-Agent header CAN be used very effectively to block most bad agents. But, I have added more features so that we can also block agents which look ok, but behave badly by going somewhere they shouldn't - the Spambot Trap. More on that soon.

You'll notice that the bad_agents.txt file which I have supplied here is very comprehensive. A good strategy here is probably to save the full version somewhere (perhaps as bad_agents.txt.all), and just keep the ones you actually encounter in the bad_agents.txt file. Then you keep the list shorter, and more relevant to what actually hits you. For example, my bad_agents.txt file currently has the following lines in it, because these are the spambots that I see most frequently (current as of 2003-06-22 - I'll update it occasionally):

	^[a-zA-Z0-9]+$
	^Baiduspider
	^Franklin Locator
	^IUFW Web
	^Mac Finder
	^Missigua Locate
	^Missigua Locator
	^Missouri College Browse
	^Program Shareware
	^Ram Finder
	^Under the Rainbow
	^WebFilter
	^WEP Search
	^Xenu Link Sleuth
	^Zeus
First of all, I know that some people will probably complain that some of these agents are not technically 'spambots'. I have blocked them based on the behavior which I have actually observed on my sites. Some agents will often ignore robots.txt, and moreover will demand many, many pages extremely quickly. This is not a well-behaved robot, so it's not welcome on my site. Some so-called legitimate agents have so frequently fallen into the trap that I decided to treat them as hostile (or, at the very least, incompetent). You can make your own decisions based on the behavior which you observe. For me, I tend to only add agents to my blacklist when I have seen a lot of bad activity over a period of more than a week or so.

You'll also notice from this that BlockAgents.pm is very flexible, being able to take full advantage of the excellent regular expression capabilities of Perl. This means you can capture a lot of different agents with just one line. For example, the very first line catches all the variations of the agent which passes in random strings of capital letters, e.g. FHASFJDDJKHG or UYTWHJVJ. The spambot obviously thinks it's being pretty smart by looking different each time, but by using an easily identifiable pattern, it shoots itself in the foot. Hah.

The original version of the BlockAgent.pm script is well explained in the O'Reilly book, but I've added an extra hook that checks to see whether the client is accessing any of the spambot trap directories. If it is, then we add an entry to the MySQL database (you could use another relational database if you want, as long as it's accessible from Perl DBI).

The first time an IP address is blocked, an expiry of one minute is set. If the same host subsequently comes in and falls into the trap again, then the expiry time is doubled. And so on. This way, the block gets longer and longer, in proportion to how persistently the spambot revisits our website. Also, the initial block expires very quickly, so that if the spambot is coming from a large network such as AOL (which works through multiple IP address proxies), you won't be blocking everyone on that network. The final benefit of quick expiry is that you probably won't build up a very large list of blocked IP addresses in your iptables list, thus saving resources.

The expiry time is a hint for the minimum amount of time the block should exist, rather than being an exact measure. This is because the badhosts_loop script (see below) has a cycle time of 10 minutes, which is the amount of time the script waits before regenerating the blocks (if no other hosts get blocked). This means that a block which has an expiry of one minute could potentially be in place for up to ten minutes. This is actually a benefit, since ten minutes is a pretty good period of time to block a host in the initial case. If the host keeps re-offending, then its expiry time will gradually increase, the number of minutes being doubled each time - 2, 4, 8, 16, 32 and so on.

Once an IP address is blocked, the spambot can't even connect to our web server, since we use 'Deny' in the iptables rule. This means that no acknowledgement is given to any packets coming in from the badhost, and as far as they know, our server has just gone away. Hopefully, after this happens for long enough, our server will be taken off the spambot's "visit" list. Another nice little side-effect of this is that the spambot will probably have to wait for a while before giving up each connection attempt. Anything that makes them waste more time is ok by me!

BlockAgent.pm notifies the badhosts_loop script that something has happened by touching a file called /tmp/badhost.new. The badhosts_loop file checks this file every few seconds and if it has changed then it knows that a new record's been added to the database, and it needs to re-generate the blocks list.

The BlockAgent.pm script is our alarm system. It's what tells us that something happened. In order to act on this information, we need to be able to add rules to the iptables firewall. We'll cover this next.

        
        

iptables

                

Bookmark | Edit | | Report | Link
File: iptables.conf
Type: Bourne shell script text executable
Size: 3 KB

Sample iptables config script

The iptables module is a very nice way of providing a good level of basic network security to your server. It's a very easy way to configure who can and cannot have access to your machine, at the most basic level - network packets, ports and protocols.

The example iptables config file given here is complete, but you should customize it to your own needs in terms of what services you need to be visible to the world. Remember: If you are not really using a service, then turn it off, and block that port. The best security policy is one where you say "Block everything by default" and then only explicitly allow those services that you know you need. This is much better than allowing everything by default and then attempting to figure out everything that you should block - you'll always miss something. Incidentally, be very careful experimenting with iptables, especially if you are working on a remotely hosted server which you don't have physical access to. It's quite easy to block yourself completely, in which case you'll find it very difficult to login to fix the problem!

The bit of this script which is most relevant to the spambot trap is that we create two chains, called 'blocks0' and 'blocks1'. These are our own custom chains, which we can then add rules to. The badhosts_loop script will flush these chains and build them back up whenever a spambot falls in your trap. Once the spambot's IP address is on the blocks list, that host cannot connect to your server at all (at least, via HTTP - other protocols such as ssh are left open for safety, in case you manage to get yourself blocked).

The iptables.conf file is an executable script which should be run at bootup along with all the other services in your /etc/init.d/ or /etc/rc.d/ directory. You'll need to consult with your Linux distribution documentation to see how to set this up (they all seem to do it a little differently - for example, Debian uses update-rc.d, RedHat uses chkconfig, and so on). You only need to run iptables.conf once after every boot, to set up the blocks chains. You could aso just add the following line to your /etc/rc.local file (or equivalent), so that the script is automatically run on reboot:

	/path/to/iptables.conf

The reason why we need two blocks chains (blocks0 and blocks1) is covered in the next section on badhosts_loop - this is the the script that actually adds the firewall rules.

        
        

badhosts_loop

                

Bookmark | Edit | | Report | Link
File: badhosts_loop
Type: perl script text executable
Size: 4 KB

Script run in the background to actively block spambots using iptables

Bookmark | Edit | | Report | Link
File: block_ip
Type: perl script text executable
Size: 572 bytes

Script to add bad hosts manually at the command line

You run this script in the background, as root. It has to be run as root, because only root has the ability to add rules to the firewall. The script spends most of its time sleeping. It wakes up every five seconds or so and does a quick check on /tmp/badhost.new. If this file has been changed since the last time it looked, then it goes and re-generates the firewall blocks list with all the current (non-expired) blocks. If nothing else happens, then the script will automatically cycle every ten minutes (by default - you could adjust this for your situation, by changing the value of the $seconds_in_cycle variable), to ensure that blocks really do expire even if there is no new activity.

The first version of this script used a single blocks chain, and it was discovered that spambots could connect to the webserver during the brief period when the blocks chain had been flushed and then was being rebuilt. The new version uses two chains (blocks0 and blocks1) round-robin style. For example, if it used blocks0 the previous time the rules were updated, then this time it will use blocks1, and so on. First we add the currently active block rules to the current chain, and then flush the previously used chain. This ensures that there is no period during which there are no blocks rules. The new method means that there is always at least one chain that has active block rules, and during the rebuild process, these two chains will briefly overlap before the old one is flushed. This ensures seamless blocking, with no gaps for the spambots to sneak through.

You need to add badhosts_loop to your startup scripts, so that it is started every boot. As with the iptables.conf script, you will need to consult with your Linux distribution documentation to see how to set this up (they all seem to do it a little differently - for example, Debian uses update-rc.d, RedHat uses chkconfig, and so on). You could aso just add the following line to your /etc/rc.local file (or equivalent), so that the script is automatically started up on reboot:

	/path/to/badhosts_loop --loop &
This will start the script looping in the background. The script automatically checks to see if it is already running, by attempting to lock /var/lock/badhosts_loop.lock. If the file is already locked then the script will exit with an error message. If you want to just run the script once, without looping, then just omit the '--loop' option. This can be useful for testing. Also, if you want to just prod the script to make it update the iptables blocks, you just need to touch the alert file, /tmp/badhost.new.

Logging is done to /var/log/badhosts_loop.log by default. Every time the script generates the blocks list, it writes a list of all the blocks to the log. This is a good place to monitor if you're interested in what hosts are being blocked. You can see examples of the log output in the Spambot Trap Log Snapshots article. The log shows the IP address which is being added, then, in parentheses, the power of 2 which is being used to calculate the expiration time in minutes. For example, (3) means an expiration of 23, which is 8 minutes. The power is increased by one every time the same IP is blocked. Next we have the start and end dates/times for this block, the general reason for the block (agent, trap or manual - see below) and finally the name of the User-Agent which committed the crime. This can be useful for quickly seeing whether you need to add a new one to the bad_agents.txt file.

The 'reason' for the block has three possibilities: Either the bot was recognised as a bad agent immediately through its User-Agent HTTP header ('agent'), or it fell into the trap directory ('trap') or else it was added manually ('manual'). The manual option is there so that you can use the block_ip script to add bad hosts yourself explicitly, from the command line. For example, if you notice a lot of blocks occurring from the same subnet, then you may want to block that subnet completely for efficiency, rather than having each individual IP address needlessly clogging up your filters. The syntax is block_ip <ip address> <power> <reason>. You can use the standard iptables notation for denoting IP ranges. For example:

	block_ip '207.150.173.0/24' '20' 'Abusive subnet'
The script will automatically touch the /tmp/badhosts.new file, so that the block will be enabled immediately (or within seconds).

The badhosts_loop script is a pretty stable program that should just sit there and chug quietly, not taking up much in the way of resources. Checking for a file being changed every five seconds is not a big deal in Unix, so you shouldn't even notice it.

Now you have to create the trap itself - the spambot_trap directory.

        
        

spambot_trap/ Directory

                

Bookmark | Edit | | Report | Link
File: spambot_trap.tar.gz
Type: gzip compressed data, from Unix
Size: 4 KB

Sample spambot_trap directory

You can create this directory anywhere on your server. We will create an alias the httpd.conf to access it. I put mine in /www/spambot_trap/. The point is, this doesn't have to be a real directory under your webserver directory root. If you use the Alias directive, then multiple websites can access the same spambot_trap directory, potentially through different aliases. You can use the sample tarball as a starting point, it has subdirectories and links which the spambots I have seen find irresistable. You should create your own image file for the unblock_email.gif file, to have a valid email address of your own.

The spambot_trap and spambot_trap/guestbook/ directories are not used directly to spring the trap. This is because I wanted to have a warning level, a lead-in, where real users would be able to realize they are getting into dangerous waters and could then back out. You're going to be placing hard-to-click links on your web pages which lead into the real trap, and there's always a chance that a real user will accidentally click on one of these. So, some of the links will point into the warning level. I have made a GIF image which contains a warning text. Why an image? Mainly because spambots can't understand images, and I didn't want to give big clues like "WARNING!!! DO NOT ENTER" in plain text. So, the user sees the warning, the spambots don't. If the spambot proceeds into any of the subdirectories (email, contact, post, message), then the trap is sprung and the host is blocked.

You also need to try to stop good spiders (e.g. google) from falling into the spambot trap and being blocked. To do this, we utilize the robots.txt file.

        
        

robots.txt

                

Bookmark | Edit | | Report | Link
File: robots.txt.static
Type: ASCII text
Size: 217 bytes

Sample robots.txt

This should allow good robots (such as google) to surf your site without falling into the spambot trap. Most bad spambots don't even check the robots.txt file, so this is mainly for protection of the good bots.

You'll see that we list a bunch of directories under '/squirrel'. This could be anything; you'll set an alias later in httpd.conf. In fact, you may even want this to be dynamically generated (see later, under Embperl), so that you can quickly change the name of the spambot trap directory if the spambots adapt and start avoiding it. At present, a static setup should work just fine, however.

Next, we need to look at the bait - links within your HTML files which lead the spambot into the trap.

        
        

Your HTML Files

                

Bookmark | Edit | | Report | Link
File: sample_html.txt
Type: HTML document text
Size: 366 bytes

Sample HTML code for trap

Bookmark | Edit | | Report | Link
File: guestbook.gif
Type: GIF image data, version 89a, 1 x 1
Size: 43 bytes

Sample transparent 1 pixel image for hiding the trap

Here's an example of HTML with links into the spambot trap:

	<HTML>
	
	<BODY BGCOLOR="beige">
	<A HREF="/squirrel/guestbook/message/"></A>
	<A HREF="/squirrel/guestbook/post/"><IMG SRC="/images/guestbook.gif" 
             WIDTH=1 HEIGHT=1 BORDER=0 ALT="Warning text"></A>
	
	Body of the page here
	
	<TABLE WIDTH=100%>
	       <TR>
	           <TD ALIGN=RIGHT>
			  <A HREF="/squirrel/guestbook/">
				<SMALL><FONT COLOR="beige">guestbook</FONT></SMALL>
			  </A>
		   </TD>
	       </TR>
	</TABLE>
	
	</BODY>
	
	</HTML>
Spambots tend to be stupid. You'd think they would check for empty links (which don't show up in a real browser), but they don't seem to. Sure, they may get smarter, but meantime you might as well pick the low hanging fruit. So, the very first thing in the body of your HTML should be an empty link which goes straight into the trap proper - not the warning level, but the actual trap itself. This is because there is no way for someone using a real browser to click on this link, and good spiders will ignore it anyway because it's in the robots.txt file.

You could also use fake HTML tags in between the link anchors, something like this:

	<A HREF="/squirrel/guestbook/message/"><faketag></faketag></A>
(Thanks to Paul Williams for that one) Incidentally, Joseph Wasson points out that users can still accidentally find these "hidden" links if they use the TAB key to cycle through links. One more thing to consider. Still, this will happen relatively infrequently, and those unfortunates who do fall in the trap by mistake should at least have an email address to fall back on - the one displayed on the "You've been blocked" page. Come to think about it, you might want to make that email address not only an image, so that text-based browser users can get themselves unblocked too!

We also use a one pixel transparent GIF (a favorite web bug technique) to anchor a link to the trap, just in case the spambot is smart enough to avoid empty links. If we put this as the very first thing in the body, then it'll be pretty hard for a real user to click on, since it's only one pixel in size. But a spambot will quite happily go there!

Finally, there is an example of a non-graphic, text based link. This will be placed on the right side of the screen by the table, and the text will appear in the same color as the background (in this example, beige). The link does not go straight into the trap, but into the warning level, because with this one there is a bigger chance that real people could click on it accidentally. The link may be invisible, but it's still there, and someone could find it. So, they get to see a nice warning, and they should back off from there. But the spambot won't. By the way, we have the link going to /squirrel/guestbook/ rather than just /squirrel/ because some of the spambots seem to specifically follow links with certain keywords, e.g. 'guestbook', 'message', 'post', etc.

One caveat: These single-pixel images and "invisible" links will show up on browsers for the blind, and other text-to-speech browsers. Moreover, they won't be able to read the warning image! So, you might be more comfortable just using the empty link option (not sure if braille browsers follow those too...). Something to think about. You could also make the warning text into plain text rather than an image, I doubt if the spambots parse any meaning from text, in reality. Another idea: Try putting some warning on the front page of your site, to the effect that the spambot trap is there, perhaps with a link to a page where they can find out more. Finally, you might put ALT text in the IMG tag so that people with text browsers can at least get a clue. Try to use non-obvious text, not stuff like "SPAMTRAP WARNING DO NOT CLICK", which is the sort of thing a spambot might be programmed to recognise... perhaps a haiku:

The one who clicks here
shall see much time pass before
he is forgiven

Walter Loscutoff suggests putting the hidden links in a DIV, which is set to be invisible. Then the trap code could look perfectly normal to the bot, but be invisible to a normal user. For example (the following code goes between the BODY tags):

	<DIV ID="SpamTrap1DIV" STYLE="position:absolute; left:0; top:0; width:50; height:50;
           clip:rect(0,50,50,0); z-index:1; visibility:hidden;">
	     <A HREF="/squirrel/guestbook/email/">Click here for emails</A>
	</DIV>
And a second option is to simply make the DIV visible, but place it off page ...
	<DIV ID="SpamTrap1DIV" STYLE="position:absolute; left:-100; top:-100; width:50; height:50;
           clip:rect(0,50,50,0); z-index:1;">
	     <A HREF="/squirrel/guestbook/email/">Click here for emails</A>
	</DIV>
Both of these options would be a headache for a spambot coder, because many sites use DIVs that might start invisible or off-page - thus the spambot has no way of knowing what might be a trap and what's regular HTML.

You can sprinkle these hidden links all around your HTML files. I put them in every single one, since I use Embperl templates which make that sort of thing very easy.

Finally, it's possible for normal browsers to fall into the trap via keyboard shortcuts - in Internet Explorer, the Tab key cycles through links on the page. If the trap is one of the first links (as it should be) then hitting Tab followed by Enter might get the unwary visitor blocked by accident. One way to prevent this is to add an onClick javascript handler to the link, which tells the browser not to follow the link by returning 'false'. Unfortunately IE does not honor this convention, so a little extra is needed to make it happen for most Windows users:

        <A HREF="/your/trap/dir/" onClick="event.returnValue = false; return false;"></A>
                
                

Embperl

                        

Bookmark | Edit | | Report | Link
File: robots.txt.embperl
Type: ASCII text
Size: 337 bytes

Sample dynamic HTML code using Embperl

Embperl is a very nice templating solution for embedding Perl in your HTML pages, making it all very dynamic. I use it for all my web development. It also has features which make it easy to construct your websites in a modular, object-oriented manner (I wrote a tutorial for EmbperlObject).

The point of this is to make it easier to change the spambot trap directory without having to edit a whole bunch of files. We pass an environment variable to Perl from httpd.conf (see below), which says what the trap directory is called. We then use this in Embperl to substitute into the HTML and robots.txt files at request time. Thus if we wanted to change the name of the trap from 'squirrel' to 'badger', then we only need to change httpd.conf, restart apache, and we're done. All the links in the HTML are dynamic, as is robots.txt (see the samples above).

Now, we bring it all together in the Apache configuration file.

        
        

httpd.conf

                

Bookmark | Edit | | Report | Link
File: httpd.conf
Type: ASCII English text
Size: 2 KB

Sample httpd.conf directives

Bookmark | Edit | | Report | Link
File: startup.pl
Type: perl script text
Size: 491 bytes

Sample startup.pl script (used in httpd.conf)

You need to have mod_perl installed before you can use BlockAgent.pm. You should take a look at the sample given above, and integrate these directives into your own virtual hosts. The most important lines are:

	Alias /squirrel /www/spambot_trap
	PerlSetEnv SPAMBOT_TRAP_DIR squirrel
You should set the 'squirrel' name to whatever you'd like for your website; you'll then access the trap using a URL something like http://www.example.com/squirrel/guestbook/message/. This will spring the trap. You also need to set up the BlockAgent.pm access handler:
	<Location />
		  PerlAccessHandler Apache::BlockAgent
		  PerlSetVar BlockAgentFile /www/conf/bad_agents.txt
	</Location>
This ensures that all accesses to your website will go through BlockAgent.pm first. You should choose your own location for the bad_agents.txt file.

Finally, you might want to install Embperl so that you can embed Perl into your HTML code (always executed on the server side, never seen on the client side):

	# Set EmbPerl handler for main directory
	<Directory "/www/vhosts/www.example.com/htdocs/">

		# Handle HTML files with Embperl
		<FilesMatch ".*\.html$">
			SetHandler  perl-script
			PerlHandler HTML::Embperl
			Options     ExecCGI
		</FilesMatch>

		# Handle robots.txt with Embperl
		<FilesMatch "^robots.txt$">
			SetHandler  perl-script
			PerlHandler HTML::Embperl
			Options     ExecCGI
		</FilesMatch>

	</Directory>
That about does it. You should now have the setup which will allow you to block spambots. You'll probably be interested in monitoring what happens...
        
        

Monitoring

                

Bookmark | Edit | | Report | Link
File: monitor_logs
Type: Bourne-Again shell script text executable
Size: 161 bytes

Sample script for monitoring web server logs

This simple script just tails the badhosts_loop log. You'll have fun (I do) seeing what comes on your site and promptly falls into the trap, and then SPLAT. No more spambot. Heh heh heh.

        
        

Conclusions

                
This setup works pretty well for me at the moment. I've no doubt there are flaws in my design, but it seems stable and is "good enough" for the time being. If you can see any improvements then I'd love to hear about them. To finish up, here's a summary of the strengths and potential weaknesses of the Spambot Trap system.
                
                

Strengths

                        
  • Does not rely exclusively on the HTTP User-Agent header, but at the same time allows us to block agents which we know to be bad.

  • Does not rely on the spambot abusing the robots.txt file. Many spambots don't even load it. But the robots.txt file will protect "good" robots from falling into the spambot trap. So, for example, googlebot will be just fine.

  • The blocks happen based on behavior, rather than trusting anything the spambot tells us about itself (e.g. User-Agent). Thus we don't rely on any prior knowledge of the spambots in order to block them; an entirely new one that we've never seen before will still fall in the trap and be duly blocked.

  • Once a spambot is blocked, then it cannot connect to your server again at all for the duration of the block. If it tries to connect, it won't even get a 'connection refused' error, because the firewall rule just quietly drops all the packets from the bad hosts. The iptables firewall is very effective, and more efficient at blocking hosts than anything you could put together with Apache. So, you save on server resources. If you're wondering whether the block lists might get large, I have found that with the constant expiring of blocks, the active block list has never been more than about 20 IP addresses at a time, out of a list (so far) of over 200 distinct hosts.

  • The blocks initially expire after one minute. This means that one-off offenders are quickly removed from the firewall rules. On the other hand, repeat offenders get progressively longer and longer blocks (doubled each time). This means that the more abusive a host is, the longer it will be blocked each time. It also means that if a bot is coming in from multiple IP addresses (through a proxy), then each of the individual IP addresses will probably not go on to be blocked for too long. Thus you won't be blocking everyone in AOL. On the other hand, if you continue to get hit from the same network, then it's obviously a source of trouble and should be blocked. If it's a major network like AOL, which you really don't want to block, then you need to take the IP addresses and times of the abuse, and send it to the sysadmin at the ISP concerned. There's really not a lot else you can do. I haven't seen this in reality, though. In my experience, the spambots come in from all sorts of different IP addresses, and the ones that are very persistent over time are mostly static IPs from DSL and small ranges of IPs from cable modems. These are the people with the always-on, high bandwidth capabilities which are needed for large scale email harvesting.

  • The system uses a relational database to manage the blocks, and so it is very scalable, and potentially you could share the database between multiple servers. If any one server gets a spambot, the the offending IP address can automatically also be blocked at all the other servers. Also, the fact that we don't delete expired blocks means that we can keep track of the history of the blocks, and perhaps perform analyses which would lead to more permanent iptables blocks of entire subnets, if desired.
                
                

Weaknesses

                        
  • It would be possible for the spambots to get wise, and start following the robots.txt file rules. Then the spambot could in theory surf your entire site (or at least the bits allowed by robots.txt) without falling into the trap. However this also means that you can control where the spambot goes, which is the whole point of robots.txt. If you want, you can allow google into one part of the site, but exclude all others. Still, you should remove all email addresses from your site as the fail-safe.

  • It's possible that a spambot could come in through a proxy such as AOL, which means you'll be blocking multiple AOL IP addresses. This is not very nice, and I'm not sure what the solution is at the moment. All I can say is that it hasn't happened yet, and the worst offenders on my site all have static IPs. They seem to come in from cable and DSL connections mostly.

  • I don't know how feasible this would be, but it may be possible to conduct a "denial of service" type attack on your webserver by making many requests to the spambot trap directory from different IP addresses. I think, however, that you actually need to have those IP addresses (rather than spoofing them) in order to set up a real TCP connection with the web server - I think you need more than one packet to make a web request. I don't know how likely this is, but it comes more under the "attack" category than spambots. If someone tries this on your site, then it's definitely something that can be pursued with legal means. It's no longer just a petty annoyance, but rather a hostile action which must be chased down. Also, the motivation is totally different - the spammers don't want to do this kind of thing. They just want their email addresses. The DDOS attacks are notoriously difficult to track, but I think in the couple of years that have passed since the first ones brought down Amazon and Yahoo!, there has been some progress made. Anyhow, I just wanted to bring the idea into the light of day. If anyone has any clues about it then I'd be glad to know.

                
                

Possible future enhancements

                        
  • Spot large numbers of blocks occurring on a particular subnet, and automatically consolidate blocks into a single one which blocks the entire subnet (e.g. 128.123.31.0/24).

  • More interactive tools to allow removal of blocks

  • Analysis tools which can tell us something about patterns of abuse from particular networks.

  • Add the ability to generate pages filled with useless email addresses and tarpits (see below).

  • You could fairly easily change the expiry units to 'hours' or 'days' (I originally had days). You would need to change BlockAgents.pm and badhosts_loop, in particular the MySQL queries and the number of seconds between automatic regeneration of the blocks list (to make sure that blocks expire even if there are no new ones). If you used 'days' then it would take a shorter time for a host to be blocked for a reasonable time, but on the other hand you would be blocking first time offenders for quite a long time (1 day initially). In a nutshell, you can tune the expiry times with this system, so that it suites the kind of activity which you observe. In my experience, units of days or minutes works just fine, it's a matter of taste.

If you can think of any more potential problems (or unrecognised strengths!) then I'd be happy to hear about it. I'd also like to hear about any comments on this document.
        
        

Alternative Ideas

                
If you find the concept of blocking IP addresses distasteful (and I would understand that), then there are other ideas which have been put forward by a number of people. These include just feeding the spambot lots and lots of fake garbage email addresses (e.g. Wpoison). This "poisons" the spambot's harvest. Or, slow it down: Feed the spambot very slow loading pages. Or even try to crash them with faulty formatted pages (I love that one - exploit any possible buffer overflow bugs in the spambot by feeding it an email address thousands of characters in length)...

These are interesting ideas which are more pro-active, but also involving more in the way of resources on your server. Every slow-loading connection means one Apache process being tied up for the duration of the request. But, it's a worthy goal - I think if people try out a variety of methods, then the biodiversity of it all will also be harmful to the spambots. The more things they have to guard against, the better, in my opinion...

I may try to incorporate these concepts in future versions of the spambot trap. A choice of methods can only be a good thing...


Appendices

        
        

Update 1: April 26th 2002: Evolution in action - the Spambots Strike Back

                

Changing expiry times to minutes rather than days

Since the original publication of the article I decided to change the units of the block expirations from 'days' to 'minutes'. Having thought about the problem of accidental and 'one-off' offenders, it seemed like a good idea, and seems to work just fine. Now, the badhosts_loop cycles every 10 minutes (rather than every day). Initial blocks expire in 1 minute, but may be in force for up to 10 minutes (until badhosts_loop gets around to regenerating the blocks). This seems to be a good compromise between measuring expirations in minutes, and not having TOO small a block expiry time - being blocked for 10 minutes is no big deal, and is enough to make most spambots give up for a while. And if they come back, the blocks just double each time.

Evolution in action - the Spambots Strike Back

I thought all was well until last night, when I noticed some suspicious activity on my crazyguyonabike.com server logs. There seemed to be a lot of activity in the guestbook, which was the behavior I had seen before from the DSurf et al spambots. But, this thing wasn't falling into the trap. I quickly checked the User-Agent, and found it was "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)", which is quite a common browser. On looking at the logs more closely, I quickly realized that the spambots seemed to have gotten much, much smarter. Here are the relevant access, agent and referer log entries. I could be wrong about this, in which case I apologize in advance to the person who owns this IP address (apparently it resolves to tmp002048081386.STUDENT.CWRU.Edu - but naahhh, a student would NEVER be a source of this kinda thing, right??? (There's a new twist to this story - see the next update)... but based on what I've seen so far, I'm thinking the the spambot is doing the following:

  1. Using Google to find pages. The requests come straight into the middle of the site, and sequential requests bear no relationship to each other (e.g. two requests for pages within journals are for totally unrelated journals). Also, I recently changed the inner workings of my site so that the queries use "doc_id" instead of "tour_id" to select a document. But these requests use the old "tour_id", which is what Google probably still has in its database. I only assume Google is the search engine because it is the most comprehensive search database on the planet, and if I were making a spambot to do this, it's what I'd use, wouldn't you? Finally, various people have independently informed me of the existence of this kind of spambot.

  2. Loading very specific pages. As before, the spambot is only requesting pages which look promising for email addresses - journals and guestbooks. This is what makes me think that this is a "new and improved" version of the earlier beast.

  3. Not making rapid requests. This gets around the old spambot error of loading too many pages too quickly, thus removing one more behavior pattern that we could spot.

  4. Following no links within the target site. This is the crucial evolution. The writer of the spambot made an important realization, which is that if you're using Google to find the pages, then you really don't HAVE to follow any links within the websites you're harvesting, since Google will eventually probably give you the links anyway, and you can go there directly. The spambot author must have also put this together with the realization that spambot traps are getting more and more risky, and robots.txt shouldn't even be touched. So, the spambot basically can't trust any links on the websites it visits. The websites are getting more prickly, just like in the wild when plants evolve defenses against being eaten. Now, the spambots have evolved to get around the prickly websites.

  5. Spoofing the User-Agent. The spambot authors finally realized that you might as well pass in a very common string for the User-Agent, thus invalidating any checks which might be made based on that. It had to happen eventually. Looks like it finally has.

  6. Not passing in Referer headers. The spambot never passes any referer headers, which may be one way of detecting it. But, all it has to do is evolve again to start passing these in, so this isn't a very compelling feature.

  7. Not loading images. The spambot never loads any images. This too could be a clue to deducing that a particular agent is a spambot. But again, some people turn off images in their browsers, and braille browsers (and Lynx) never load images anyway. Of course, you could always check for the User-Agent to see if it's a blind user, but then the spambot would just start passing in whatever was acceptable.

  8. Not detecting my Spambot Trap specifically. I thought initially that perhaps the spambot authors put in something to recognize my trap. So, I changed the look of the trap to see if the spambot would then follow links - to no avail. I think it just has a policy now of following no links at all.

In a nutshell, this particular spambot seems to have "evolved" to the point it is able to come into your website, one page at a time, working from search engine results. It looks just like a standard browser. It doesn't have to follow any links from your pages. So, how on earth do you block a spambot that looks exactly like a person? (It should be noted that not all spambots are doing this - I have just observed the one instance. But you can be sure that if it works, then it will spread. However, the Spambot Trap is still very useful for stopping badly behaved robots and other, more stupid spambots - of which there are many out there.)

The long term answer, I think, is that you just can't easily block a spambot that behaves exactly like a real browser. You could do all kinds of tricks, like requiring JavaScript, or cookies, or analysing behavior such as the non-loading of images or lack of a referer field. But in the end, all these will be smoothed over as the spambot writers add features to the toolkit. We just have to accept that the things will eventually look exactly like people browsing our site. I for one am not too keen on requiring JavaScript or cookies on my site, since I personally turn JavaScript off for security reasons, and also to stop those annoying popup ads. And a lot of people have privacy concerns over cookies. And anyway, all the spambot authors have to do is incorporate cookies (can't be all that hard). The JavaScript engine would be more tricky, but again - I'm not too keen on requiring JavaScript as a fundamental foundation of the internet. It's introducing just another level of complexity on something that was beautifully simple - HTML and HTTP.

In the short term, I believe I've found a way to block the new spambot. I'll experiment to see if the technique is effective - but would it be good to publish it here? Maybe, maybe not. The argument for total openness says that everyone should know about a technique that successfully blocks a certain spambot. But then the spambot authors also get to hear about it, and promptly plug the hole. So - better to keep our little weapons to ourselves, and keep the spambot writers in the dark, or better to be totally open, and then have the block neutralised quickly? That's a tough one...

So, why give the spambot publicity at all? Because I don't believe that we are best served by silence on these issues, in the wider context. The open source community has proved again and again that openness is the best policy when dealing with bugs, security issues and other threats. I am trying to bring this issue to the attention of a larger audience in the hope that some smart person out there will be able to figure out the next step in the evolutionary process.

Of course, the final answer is to remove email addresses from your website altogether, or else obfuscate them using one of the techniques mentioned earlier in this document. That's fine. But I am still incensed that I am required to play host to these things, especially given how repulsive I find spam in general. The spambots continue to use up my server resources, and that makes me mad...

I am thinking about ways to automatically analyse the behavior of the spambots so that I can block them based on their actions. Perhaps keep a database of IP addresses of hosts requesting documents. Using a fast, simple database like MySQL that shouldn't be a problem - anyway, just about every page in my website is already dynamically generated, so it's no big deal. If we keep track of the activity, we could note that a) no images have been loaded, b) the User-Agent says it's a browser rather than a spider, and c) no referer fields - the current behavior of the spambot would betray it. But, as I mentioned earlier, it would be relatively easy for the spambot to start loading images (and just discarding them), and passing in referer headers.

I also thought about using the behavior to detect (after a few requests) some suspicious activity, and then redirecting the browser concerned to a page which requires a human to answer some kind of question in order to validate themselves as a non-spambot. For example, if a particular IP address loads a lot of pages, but no images, and claims to be Mozilla compatible, AND provides no referer fields, then that might be good criteria for a quick checkup. It's just a vague idea at this point. It could be some kind of basic multiple choice form, just enough to require a real person. Obviously there should be a database of questions, which needs to be updated regularly, to stop the spambots being "taught" the correct answers. After submitting the form, the user is then returned to the page where they were before, and can continue. I don't know how feasible this would be. It even crossed my mind to have some kind of third party, non-profit website, something like "personvalidator.org", which could be used to validate people from multiple websites using this method. The questions and answers could then be centralized on a site which is specifically designed to root out non-humans. The website could have some kind of very simple CGI API for passing in the original web page, so that the user can be redirected back again after validation. This may be a silly idea, but it's worth at least thinking about... an alternative to the multiple choice questions is to have an image which only a human can read, and they then have to type the text or number into a form before continuing.

This has already been implemented on some sites, including Go Daddy Software's DNS lookup page (here's a example). Another famous example is My Yahoo!, where you have to read an image while registering. They have a special workaround for blind people, who are instructed to call Yahoo!'s customer care department and it all happens over the phone. I wonder if the spambots will eventually develop the ability to converse with people over a telephone? Perhaps they will then be talking with a Customer Care bot on the other end. Shudder...

An interesting research example is Captcha, a project run by Carnegie Mellon University school of Computer Science. According to their site, "CAPTCHA stands for 'Completely Automated Public Turing Test to Tell Computers and Humans Apart'. The P for Public means that the code and the data used by a captcha should be publicly available. Thus a program that can generate and grade tests that distinguish humans from computers, but whose code or data are private, is not a captcha." - Very interesting research.

In Summary

Of course, I realize that I could just sit back and relax and let these things try their darndest to get email addresses, because they won't succeed. After all, it's not exactly going to kill my server. But that's not the point. Let me be clear: I REALLY HATE SPAM. I detest the way that the open, co-operative, sharing nature of the internet is being co-opted by these people. They are forcing everyone to be more closed, more fearful of doing anything on the Web. I resent being forced to accept these things to surf my website freely.

So. What will the next step in the evolutionary process be? Will we be forced to just live with these things, these software versions of mosquitoes, as being part of the natural ecology of the internet? Or can there be a technical solution to the spambot pest? Ideas most welcome...

        
        

Update 2: May 1st 2002: The author of the new, scary spambot comes forward

                

The author of the new, scary spambot comes forward

Shortly after the second slashdot article, I got an email from someone who claimed to be the author of the new, improved spambot. Checking the IP address of the sender against my server logs, I was able to confirm that this was indeed the same as the source of the spambot. He explained in some detail that this was just an experimental academic project of his, and it wasn't being done in any kind of commercial context. This is not related to DSurf, PBrowse, QYTRWYTR et al. He was quite embarrassed to have the thing exposed in this way, and he apologised for the inconvenience... I am inclined to believe him, for what it's worth. It's nice to have some openness for a change in this sordid arena... to tell you the truth, I am both gratified and mortified. Gratified that he should have come forward so quickly to assure me that this was an innocent experiment in Web spidering, and mortified that I was perhaps giving the spambot writers a nice little template on how to write their next generation of spambots. D'OH!

I know, some of you will reply, very cynically, that the guy is just trying to cover his ass in the wake of being exposed - and who knows, you may be right. But, I am inclined to believe my intuition, and given the tone of this guy's email and the sheer amount of detail/context he has provided in a very short space of time, I really think he's telling the truth. It's an interesting project, after all, spidering. He has written me quite detailed emails about his project, and I for one am reasonably satisfied that he is for real.

However, regardless of the source of this particular spambot, it doesn't change the basic message - the spambots can (and will) evolve beyond their current rather basic state. The question is, what do we do about it? In an interesting way (to continue the nature analogy), we could look at this particular spambot as being like a harmless version of a virus which we use to vaccinate people against the real thing. It may have done us a service, by demonstrating what can be done, without actually doing it for real. But still, what to do about the real thing?

One solution which has been suggested by a few people is to just block all spiders (including google) from the parts of the website that include stuff like guestbooks and message boards. To me this is an overkill solution that essentially means taking apart the World Wide Web as we know it, fragmenting and segmenting it so that it is no longer a comprehensively connected network of nodes. The vast majority of people these days surf via the search engines, so this is just too drastic for me. In any case, my community site is all about bicycle touring, and the guestbooks and message boards contain lots of tips and suggestions which could be interesting to other people. It's part of the whole reason for having community websites. So the idea of hiding this stuff seems counter productive. However, this approach could work for someone who isn't all that concerned about being found on google - so it's worth at least considering.

Another idea involves generating dynamic URL's within the site, so that the structure is constantly changing - in other words (assuming the website is totally dynamically generated) all the links from page to page have some component that "decays" and is invalid after a while. So, you allow google to surf the site, but all the links it gathers are effectively useless. When people find stuff via google they are redirected to the front page of your site, where they navigate manually to find what it was they were after. Again, to me this seems too extreme. I want people to find stuff on my site using Google! I don't want to throw the baby out with the bathwater...

So, there are lots of good ideas out there, all of which are worth considering, even if you eventually decide that they are not for you. Chances are, people will come up with all kinds of clever tricks to counter these beasts.

And so we come to one of the classic conundrums - do you stay quiet about these things, develop your own little defenses, keep your head down and hope that it doesn't get any worse? Or do you expose the beasts to the cold light of day and examination by many thousands of eyes (and over 20,000 people viewed the article the first time around), thus ensuring that other webmasters are at least aware of the kind of things that are out there, prowling their sites? It's very true that openness allows the spambot writers to hone their tools. But I think that the "open source" model of co-operation in the webmaster community calls for full disclosure and open discussion of these practices. To hope that these tools will somehow not proliferate is wishful thinking - and futile, in my opinion. Better to just assume that these techniques will become common knowledge among those writing spambots. So why not have it become common knowledge among the victims?

This has at least crystalised a thesis that has haunted me for some time now: That our current ability to foil the spambots depends mostly on foibles and flaws in these programs that are actually very simple for the spambot developers to correct, if they put their minds to it. I think it's just that, up to now, they really haven't had to try all that hard. But, as websites get more "prickly" and develop defenses against "hostile" spidering, it's also inevitable that if it remains profitable to scrape web pages looking for email addresses, then the spambots will "evolve" eventually to look just like standard browsers coming onto your site. It then becomes even more urgent that we respond in the only way we can: Remove all email addresses and other personal information that can be machine-read from our websites. Use contact forms, image files, JavaScript, whatever it takes - but just ensure that these spambots cannot freely harvest our personal information. In the meantime, the Spambot Trap (and other tools like it - see the links below) can help to stem the rising tide of website abuse! Good luck...

        
        

Update 3: June 29th 2002: A possible way to stop spambots that pretend to be browsers

                

One possible way to stop spambots that pretend to be browsers

The "doomsday scenario" was that eventually spambots would evolve so that they look just like ordinary browsers, whereupon it would become very difficult to distinguish them from real people. However, it turns out that it is in fact possible to corner these beasts too. Here's how it works: We take advantage of the fact that the spambot is pretending to be a browser, by using the User-Agent header.

I assume you are using the spambot trap as described above. If so, then you have hidden links in your HTML which go to the trap directory. Ordinary users should not follow these links, because they are effectively hidden from normal browsers, and good robots will avoid the trap because it is included in robots.txt.

What we do to fool the "stealth" spambots (that pretend to be browsers) is to take advantage of the fact that all browsers (except for Lynx, which hardly anyone uses for actual browsing anymore) have a User-Agent string which begins with "Mozilla". Internet Explorer, Netscape and Opera all follow this convention. None of these browsers will ever request robots.txt in the normal course of their operation; of course, a user could explicitly request the file, but we can safely assume that normal browsers don't need to ask for it. Therefore, we use the User-Agent string to determine what is provided in robots.txt. To non-browsers (e.g. googlebot) we give the full file, including the warnings to avoid the spambot trap. To User-Agents which start with "Mozilla", we dynamically remove those warnings. This won't make any difference to everyday users, but a spambot which is attempting to masquerade as a real person now has no way of avoiding the trap - if it follows any links on the page at all then it will soon end up there.

To accomplish this, I use Embperl to handle robots.txt, so that I can embed conditional code. Here's my new robots.txt:

	User-agent: *
	Disallow: /somedir/
	Disallow: /some_other_dir/
	[$ if $ENV{HTTP_USER_AGENT} !~ /^mozilla/i $]
	Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/
	Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/
	Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/post/
	Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/message/
	Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/email/
	Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/contact/
	[$ endif $]
Of course, it's still true that a spambot could request only pages linked directly from google, and never actually follow any links from our pages. This is obviously much harder to detect, but in my experience the current generation of spambots do follow links, and consequently are vulnerable to this approach. If spambots ever became smarter than this, then we would just have to detect something else, like the fact that they never load images and don't pass cookies. As I said before, it's an arms race. Currently, spambots are still pretty dumb, so this method will work. As long as most websites out there are not employing spambot traps, it's not really worth the trouble for the spambot authors to make them smarter.

Speaking of google, some people have mentioned that a spambot could in theory avoid the trap altogether by simply loading the pages that google has in its cache. This is an interesting point, but also easily solved. When serving pages to google (or, actually, any non-Mozilla user agent), simply remove all email addresses (even obfuscated ones) from the pages you serve. You could replace them with some text, or even a link to your real page. Of course, this requires a dynamic content filter such as mod_perl, Embperl or PHP, but it is a solution that works. The idea is that it really doesn't matter what pages the spambot gets from google, because there are no email addresses there at all, not even obfuscated ones that could potentially be decoded by a clever bot. Thus the problem becomes google's - it's now their web server resources being consumed, not yours. Problem solved. Alternatively, if you'd prefer that Google (and other sites) not cache your pages at all, then you can add the following tag to the HEAD section of your pages:

	<META NAME="ROBOTS" CONTENT="NOARCHIVE">
The Google.com help for webmasters section has information for webmasters which includes this and other suggestions.

The spambot trap has been working well on my system for a few months now, and is very stable. Here's the latest from my badhosts_log (see the previous section on badhosts_loop for interpreting the log):

Sat Jun 29 10:05:36 2002:
	Flushing blocks chain:
	Generating blocks list:
		Adding 68.13.151.24	(15)	2002-06-07 20:45:45 to 2002-06-30 14:53:45	YUYURRSYAA
		Adding 24.100.224.110	(14)	2002-06-18 09:44:59 to 2002-06-29 18:48:59	YYCFAWZ
		Adding 24.101.97.21	(14)	2002-06-18 11:06:57 to 2002-06-29 20:10:57	RJJVAS
		Adding 68.4.200.220	(15)	2002-06-20 12:51:45 to 2002-07-13 06:59:45	HIBGMMPBNK
		Adding 12.226.164.219	(15)	2002-06-20 18:43:55 to 2002-07-13 12:51:55	LLJZSPJPBZKG
		Adding 66.176.44.203	(14)	2002-06-24 02:54:16 to 2002-07-05 11:58:16	VVDHDXDHUHR
		Adding 66.185.84.202	(15)	2002-06-24 11:27:31 to 2002-07-17 05:35:31	ZBBGUCDP
		Adding 24.101.39.246	(14)	2002-06-25 13:53:37 to 2002-07-06 22:57:37	WACLEDZYGTU
		Adding 24.120.185.130	(12)	2002-06-26 15:22:52 to 2002-06-29 11:38:52	DBrowse 1.4b
		Adding 208.6.163.83	(12)	2002-06-26 23:29:08 to 2002-06-29 19:45:08	DBrowse 1.4b
		Adding 68.5.169.46	(14)	2002-06-28 17:55:39 to 2002-07-10 02:59:39	ODXNZHX
		Adding 216.78.174.6	(11)	2002-06-28 21:44:14 to 2002-06-30 07:52:14	XUQSXFOVQABF
		Adding 211.101.236.91	(12)	2002-06-28 22:42:20 to 2002-07-01 18:58:20	Mozilla/3.0 (compatible; Indy Library)
		Adding 24.101.56.15	(15)	2002-06-29 00:05:57 to 2002-07-21 18:13:57	VWXTOGR
As you can see, the random-letter-generating spambots seem to be dominating the field now. There are others which appear all the time, two persistent examples of which are "JBH Agent 2.0" and "MFC Foundation Class Library 4.0". You need to keep an eye on the badhosts database to see what is falling into the trap all the time. If it's an easily identified user agent then you can just add that to the bad_agents.txt file, and block it altogether. Once you know something is bad, there's no reason to let it on your site at all - just give it 403, and block it immediately.

I hope all this is useful, if only to serve as an example of how you can block spambots from your website. I'm always open to new ideas, and feedback from people who have implemented the spambot trap on their own systems. Let me know if there's anything I can do to help clarify the methods I have documented here.

        
        

Update 4: October 15th 2005: Dealing with Googlebot 2.1, which pretends to be a browser

                

Dealing with Googlebot 2.1, which pretends to be a browser

Google has started experimenting with a new version of their crawler that pretends to be a browser. It uses something like the following User-Agent string:
	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
This presented a problem for my dynamic robots.txt, because I could no longer count on being able to distinguish between bots and browsers (even bona fide ones) by the User-Agent - remember that the convention has been for browsers to start their User-Agent string with "Mozilla/x.x (Compatible; ...". In the past, that has been a reliable way to distinguish browsers from bots (assuming they are playing by the rules). But now Google obviously wants their bot to be perceived as just another browser. So, the new Googlebot started falling into the spambot trap, because when it requested robots.txt, it was seen as a browser, and so it didn't get the spambot trap directories. So, we needed some way around this - how to allow "good" bots such as googlebot, while still fooling the spambots?

The solution turns out to be that we simply have to be a bit smarter about where the bot is coming from. Pretty much the only way to tell if a bot is really from Google is if it comes from one of Google's subnets. After a little hunting around I managed to get a short list of the IP address ranges that Googlebot is likely to be coming from. Then, it was a question of recognising these ranges, and acting accordingly. So, here is robots.txt.new (using, as before, Embperl to embed Perl code in the file):

[-
   use Net::IP;
   sub is_client_legitimate_bot
   {
	my @ok_ip_ranges = ('64.233.160.0 - 64.233.191.255',  # google
			    '64.68.80.0 - 64.68.87.255',      # google
			    '66.249.64.0 - 66.249.79.255',    # google
			    '216.239.32.0 - 216.239.63.255',  # google
			   );
	my $ok_range_match = 0;
	my $ip = new Net::IP ($ENV{REMOTE_ADDR});
	foreach my $r (@ok_ip_ranges)
	{
		my $range = new Net::IP ($r);
		my $match = $ip->overlaps($range);
		if ($match == $IP_A_IN_B_OVERLAP || $match == $IP_IDENTICAL)
		{
			$ok_range_match = 1;
		}
	}

	return (($ENV{HTTP_USER_AGENT} !~ /^mozilla/i) || $ok_range_match);
   }
-]
User-agent: Googlebot
Disallow: /some/directory
Disallow: /some/other/directory
Disallow: /yet/another/directory
[$ if is_client_legitimate_bot() $]
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/post/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/message/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/email/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/contact/

[$ endif $]

User-agent: *
Disallow: /

What this does is to add a new subroutine called is_legitimate_bot(). This returns true if either the User-Agent string does not start with 'Mozilla' (in which case it's not pretending to be a browser, and so can be given the spambot trap exclusion directories), or if it is coming from within a set of IP address ranges (which I have hard coded here for simplicity). So if it's the new Googlebot, then it will still be told to stay away from the spambot trap.

You could add other IP ranges as needed, if other search crawlers start using the Mozilla string to make themselves appear to be browsers. Also, if you have more than one website, you would probably want to move the subroutine into a common module.

This seems to be working so far, at least the new Googlebot is not falling into the trap any more, but the regular spambots are still being caught and shut out.

Finally, this update also includes an addition to the badhosts_loop script: We now toggle between two overlapping blocks chains, in order to avoid gaps during chain updates when the spambots could connect to the webserver. See that section for more details. I also took the opportunity to update the scripts from ipchains to the newer iptables. Enjoy!

        
        

Update 5: October 21st 2006 - The Attack of the Botnet

                
I noticed on Tuesday 17th that crazyguyonabike was catching a large number of spambots falling into the trap. Checking the server logs, it quickly became apparent that this was some kind of botnet. Here are some log snapshots showing what I'm talking about. It looks like a botnet because a) the offending hosts change very quickly, b) the ip addresses mostly resolve to dialup, DSL or cable addresses (i.e. ordinary computers) and c) all the User-Agent signatures are identical (thus suggesting that this is one program, spread over many different computers). So it's probably some kind of virus or worm that spreads through vulnerable Windows computers (at a guess), and then operates without the owner's knowledge.

The bots all come in directly to the guestbooks on the site, suggesting use of a search engine, and they have a prediliction for visiting the 'permalink' links. I noticed that they also try to post on the guestbooks, but are stymied by the preview step. Making the forms POST rather than GET makes no difference here, but they don't seem to have (yet) figured out how to go through a preview process. Out of curiosity I decided to capture whatever it was they were posting, and sure enough it turned out to be spam, linking to a website in several different ways (all of which are, I assume, different ways of marking up links on different blogging platforms). Here's an example (I have inserted spaces to prevent the links from working):

http: // dsrrbaafefrfd. host. com
<a href="http: //dsrdbaafefrfd. host. com">desk3</a>
[url=http: //dsrsbaafefrfd. host. com]desk4[/url]
[link=http: //dsrabaafefrfd. host. com]desk6[/link]
The links are different each time, I think they are generating random characters for the subdomain. When you go to anything.host.com, you can see it is some kind of link farm.

When you do a whois on host.com, you get this:

Visit AboutUs.org for more information about host.com
<a href="http: //www .aboutus.org/host.com">AboutUs: host.com</a>

Registration Service Provided By: Web Development LLC
Contact: admin@development.com

Domain name: host.com

Registrant Contact:
   Web Development LLC
   Administration Domain (admin@development.com)
   +1.8662635742
   Fax:
   P.O. Box 570002
   Whitestone, NY 11357
   US

Administrative Contact:
   Web Development LLC
   Administration Domain (admin@development.com)
   +1.8662635742
   Fax:
   P.O. Box 570002
   Whitestone, NY 11357
   US

Technical Contact:
   Web Development LLC
   Administration Domain (admin@development.com)
   +1.8662635742
   Fax:
   P.O. Box 570002
   Whitestone, NY 11357
   US

Status: Locked

Name Servers:
   DMNS1.YAHOO.COM
   DMNS2.YAHOO.COM
   DMNS3.YAHOO.COM

Creation date: 22 Aug 1994 00:00:00
Expiration date: 21 Aug 2008 00:00:00

I tried writing to the email address (mikeb@hpnet.com) that is listed at www.aboutus.org/host.com, but I have no great hope of this doing any good.

Finally, I have added a live badhosts snapshot which gives the current blocks taken directly from the database.

        
        

Update 6: November 30th 2006 - Defending against botnets

                
I have found that the botnet of spambots that has been hitting my site can easily clog up the firewall block list if I'm not careful. I am not sure how many blocks ip_tables can handle before it starts to affect CPU load, but 200 seems like a scary number of blocks. So I had to try to modify the trap to recognize the botnet and forbid it, but not block it.

Unfortunately I don't think it's a good idea for me to publish what I did here, because it only tips off the spambot authors as to what they have to do in order to circumvent my measures. Suffice to say, it is possible to spot patterns, and accordingly to forbid the bots, without adding everything to the firewall. This way you can stop them from falling in the trap and clogging up your drain en masse, while still stopping them from surfing your site.

The battle continues. The big news lately has been about botnets being used to send spam via email. There has been little or no mention of botnets being used for spambot activity (i.e. trawling websites). This latest spambot botnet seems to be interested in posting spam on guestbooks and forums, rather than looking for email addresses. This presumably has to do with trying to post links to their websites, to boost their Google Pagerank.

It seems to be an inevitable fact of life that for any good, useful, open system, there will appear people who try to exploit it for their own selfish gain, without a care for the consequences of their actions. No matter what you create, you must not only think about how it might be used, but also how it might be abused and turned on its head by the assholes of the world. Sad, but true.

        
        

Update 7: August 2009: Spotting new botnets

                
The spambot trap has been working well on my websites since 2001, and I've been very pleased with the results. The number of hosts in the blocklist has never grown beyond about 100 to 140 or so. For a long while now I have just let it do its thing, but recently I started getting some very odd registrations on my community websites (crazyguyonabike and topicwise). The registrations were characterised by the fact that the first and last names were obviously fake - e.g. "Rama Chandra RamaChandra". They looked like bots, so I investigated to see if I could spot any patterns to the requests. Sure enough, something immediately jumped out at me: In many cases, the referer was set to be identical to the request URI. This is not normal behavior for browsers. So I added a test for this, and have been catching quite a few instances of what seems to be a new type of bot. I checked out the activity on the IP addresses that were caught in this manner, and sure enough in every case they didn't seem to be "normal" users. Often there were multiple requests per second, to pages that were completely unrelated. Obviously bots of some kind. In any case, the spambot trap has started blocking them, which is a Good Thing in my book. Frankly it's shocking how many bad agents there are out there... without a tool like this spambot trap, your website really is at the mercy of all the assholes out there. I like having at least some control over who and what gets to trawl my sites and use up my bandwidth and resources.

Links

        
        

Information Resources, Tips and How-to's

                
The Web Robots Pages - http://www.robotstxt.org/wc/robots.html
Nice informational site for learning all about web spiders and how they use robots.txt.

Spambot Beware - http://www.turnstep.com/Spambot/

Information on how to avoid, detect, and harass spambots.

Archive.org and Alexa.com -- Threats To Your Privacy - http://manly.delconet.com/klahn/privacy/index.html

Very interesting site about one guy's battle to stop his websites being copied wholesale by archive.org, and his personal information being published without permission on alexa.com.

Defending Against Email Harvesters, Leechers, and Web Beacons - http://linux.oldcrank.com/tips/antibot/

Loran T. Hughes gives advice on stopping spambots using a "sand trap" and real-time IP blackhole list similar to the one presented here, but using PHP instead of iptables - so you don't need root access to implement it. On the downside, it does require more processing on the Apache end (since PHP has to determine whether a request comes from a spambot). Very good article.

Progressive IP Blocking - http://vamos-wentworth.org/bottrap/bottrap.html

By Rossz Vamos-Wentworth. Presents a method for implementing a spambot trap similar to the one presented here (with some other wrinkles). Well worth a read.

SpamHelp - http://www.spamhelp.org/

"We intend to take a look at all the different angles related to controlling spam. As a home user, what steps can be taken to limit the volume of spam that I receive? As a network administrator, what steps can I take to control the volume of spam hitting my mail servers and entering the network? What software is available to control spam at the desktop? What server-based antispam software is available? What are governments doing to discourage spamming? The answers to all these questions and more will be found here on SpamHelp. We hope the information provided within these pages will prove useful and practical in your efforts to fight spam. Enjoy your stay and please remember to visit us on a regular basis to keep up with all the latest developments." Site also includes a Harvester Killer that generates 100 random email addresses.

        
        

Organizations and Blocklists

                
SpamCon Foundation - http://www.spamcon.org/
"Protecting email for communications and commerce by fighting against spam."

MAPS - http://www.mail-abuse.org/
Mail Abuse Prevention System. A not-for-profit California organization whose mission is to defend the Internet's e-mail system from abuse by spammers. You can configure your email server to use MAPS to block incoming spam.

The Spamhaus Project - http://www.spamhaus.org/

The Spamhaus Block List (SBL) is a free realtime DNS-based database of IP addresses of verified spammers, spam gangs and spam services. Used by Internet Service Providers and corporate networks worldwide, the SBL currently protects an estimated 140 Million mailboxes from persistent spam sources. If you run a mail server then you can configure it to use the SBL to block incoming spam.

Distributed Server Boycott List - http://www.dsbl.org/

The DSBL lists contain the IP addresses of servers which have relayed special test messages to listme@listme.dsbl.org; this can happen if the server is an open relay, an open proxy or has another vulnerability that allows anybody to deliver email to anywhere, through that server. Note that DSBL itself doesn't do any tests; it simply listens for incoming test messages and lists the server that delivers the message to DSBL's mail server. You can configure your mail server to use this list to block incoming spam.

ORDB.org - http://www.ordb.org/

A non-profit organisation which stores a IP-addresses of verified open SMTP relays. These relays are, or are likely to be, used as conduits for sending unsolicited bulk email, also known as spam. By accessing this list, system administrators are allowed to choose to accept or deny email exchange with servers at these addresses.

EasyNet - http://abuse.easynet.nl/blackholes.html

An ISP that runs a well-regarded blackhole list.

        
        

Server-side tools

                
Stopping Spam and Malware with Open Source - http://www.brettglass.com/spam/paper.html
Excellent, detailed article by Brett Glass, on using Sendmail and other tools to stop spammers. Lots of great ideas and links to resources.

Robotcop.org - http://www.robotcop.org/

A nice Apache module which does something pretty similar to what I am doing here.

How to Defeat Bad Web Robots With Apache - http://www.leekillough.com/robots.html

Very good site by Lee Killough

Wpoison - http://www.monkeys.com/wpoison/

A free server tool that dynamically generates web pages for spambots that contain lots of junk email addresses that will "poison" the spambot's harvest.

Sugarplum - http://www.devin.com/sugarplum/

Similar to Wpoison, aims to feed the spambot useless email addresses that will corrupt the spambot's database.

Using Apache to stop bad robots - http://www.evolt.org/article/Using_Apache_to_stop_bad_robots/18/15126/index.html

By Daniel Cody, a nice article which describes how to use Apache to block spambots. This first version (see part II, next) uses the User-Agent HTTP header to identify bad bots.

Stopping Spambots II - The Admin Strikes Back - http://www.evolt.org/article/Stopping_Spambots_II_The_Admin_Strikes_Back/18/21392/

By Daniel Cody, the sequel to the previous article. This version uses robots.txt and behavior to block spambots, still using Apache rules.

Deception Toolkit - http://www.all.net/dtk/

Not specific to spam, but a co-operative anti-hacking toolkit for servers. Since a lot of spam comes from hacked servers, worth a look.

LaBrea - http://www.hackbusters.net/LaBrea/

A toolkit which creates a tarpit or, as some have called it, a "sticky honeypot". LaBrea takes over unused IP addresses on a network and creates "virtual machines" that answer to connection attempts. LaBrea answers those connection attempts in a way that causes the machine at the other end to get "stuck", sometimes for a very long time.

Teergruben - http://www.iks-jena.de/mitarb/lutz/usenet/teergrube.en.html

German for 'tar pits', these mail server tools consume spammers' resources by working very, very slowly.

SpamCannibal - http://www.spamcannibal.org/

"SpamCannibal blocks spam at the origination server and can be configured to block DoS attacks. SpamCannibal uses a continually updated database containing the IP addresses of spam or DoS servers and blocks their ability to connect using a TCP/IP tarpit, SpamCannibal's TCP/IP tarpit stops spam by telling the spam server to send very small packets. SpamCannibal then causes the spam server to retry sending over and over - ideally bringing the spam server to a virtual halt for a long time or perhaps indefinitely. SpamCannibal blocks spam at the source by preventing the spam server from delivering the messages from its currently running MTA process. This effectively eliminates the network traffic to your site because the spam never leaves the origination server. This same strategy works equally well when SpamCannibal's tarpit daemon is configured to defend against DoS attacks.". Free toolkit.

Project Honey Pot - http://www.projecthoneypot.org/

"Project Honey Pot is the first and only distributed system for identifying spammers and the spambots they use to scrape addresses from your website. Using the Project Honey Pot system you can install addresses that are custom-tagged to the time and IP address of a visitor to your site. If one of these addresses begins receiving email we not only can tell that the messages are spam, but also the exact moment when the address was harvested and the IP address that gathered it. To participate in Project Honey Pot, webmasters need only install the Project Honey Pot software somewhere on their website. We handle the rest - automatically distributing addresses and receiving the mail they generate. As a result, we anticipate installing Project Honey Pot should not increase the traffic or load to your website. We collate, process, and share the data generated by your site with you. We also work with law enforcement authorities to track down and prosecute spammers. Harvesting email addresses from websites is illegal under several anti-spam laws, and the data resulting from Project Honey Pot is critical for finding those breaking the law. Additionally, we will periodically collate the email messages we receive and share the resulting corpus with anti-spam developers and researchers. The data participants in Project Honey Pot will help to build the next generation of anti-spam software. Project Honey Pot was created by Unspam, LLC - a Chicago-based anti-spam company with the singular mission of helping design and enforce effective anti-spam laws."

mod_spambot - http://spambot.sourceforge.net/

"... an Apache plugin which monitors the data being downloaded from a server. When the number of requests for a client exceeds a preset level no more downloads are allowed for a preset time. When this happens the client received a tailored message informing them of what has happend. Many of the features can be tailored to the needs of the webmaster to help to prevent false positives and to customise the definition of a client to be blacklisted."

Bot-trap - A Bad Web-Robot Blocker - http://www.danielwebb.us/software/bot-trap/

Daniel Webb writes: "I have a small PHP/Apache package that will auto-ban all bots that ignore robots.txt (and email you about it if you want), with the option to unban if a person types in the password. I have so far not had a single harvester get the email on my contact page, which is not obfuscated in any way."

Protect your web server from bad robots - http://www.rubyrobot.org/article/protect-your-web-server-from-spambots

Another take on the Spambot Trap concept, written in Ruby.

        
        

Email obfuscation tools

                
JavaScript email encryptor - http://www.jracademy.com/~jtucek/eencrypt.html
Thanks to Joe Tucek for the link.

EScrambler - http://innerpeace.org/escrambler.shtml

A client-side JavaScript tool

Alicorna email obfuscator - http://alicorna.com/obfuscator.html

A free CGI tool for easily obfuscating email addresses for inclusion in a web page. It randomly turns letters in the address into HTML entities (at least, in part of the address; parts are fully translated into HTML entities), and has recently been updated to work across browsers and email programs.

www.gazingus.org - http://www.gazingus.org/

Has a free JavaScript based encoder into Unicode.

Railhead Design - http://www.railheaddesign.com/

Tools like Spamstopper: "An easy and effective way for web designers to keep email addresses hidden from Email Harvesting Spiders. Mac OS X 10.1 and OS 9 savvy."

Fantomaster Email obfuscator - http://fantomaster.com/fantomasSuite/mailShield/famshieldsv-e.cgi

converts your email address into Unicode.

Avoiding the Spambots: An Email Encoder - http://www.metaprog.com/samples/encoder.htm

Joseph Pelrine shares some methods for encoding email addresses using JavaScript, HTML and URL encoding.

mungeMaster Online - http://www.closetnoc.com/mungemaster/mungemaster.pl

Steve Preston created this free tool to obfuscate any email address (and accompanying text) that you enter on the web form he provides. The Perl script then returns a very obfuscated JavaScript version of the address, which will probably fool most (if not all) spambots. This depends on JavaScript being enabled in the user's browser.

Email Obfuscator: A Tool For Webpages - http://sourceforge.net/projects/obfuscatortool

"A stand-alone JAVA Email obfuscation tool for HTML pages. This tool can be used to process single pages, or whole webpage bundles, to protect email addresses occuring in them from spam bots and other harvesters, by hiding them using various methods." by Sebastian Vermehren.
        
        

Email filters

                
SpamAssassin - http://sourceforge.net/projects/spamassassin/
SpamAssassin is a mail filter to identify spam using text analysis. Using its rule base, it uses a wide range of heuristic tests on mail headers and body text, to identify "spam", or unsolicited commercial email.

TMDA - http://software.libertine.org/tmda/

"An OSI certified software application designed to significantly reduce the amount of SPAM/UCE (junk-mail) you receive. TMDA combines a 'whitelist' (for known/trusted senders), a 'blacklist' (for undesired senders), and a cryptographically enhanced confirmation system (for unknown, but legitimate senders). TMDA strives to be more effectual, yet less time-consuming than traditional filters."

Vipul's Razor - http://razor.sourceforge.net/

An open source spam filtering network. From the website: "Vipul's Razor is a distributed, collaborative, spam detection and filtering network. Razor establishes a distributed and constantly updating catalogue of spam in propagation. This catalogue is used by clients to filter out known spam. On receiving a spam, a Razor Reporting Agent (run by an end-user or a troll box) calculates and submits a 20-character unique identification of the spam (a SHA Digest) to its closest Razor Catalogue Server. The Catalogue Server echos this signature to other trusted servers after storing it in its database. Prior to manual processing or transport-level reception, Razor Filtering Agents (end-users and MTAs) check their incoming mail against a Catalogue Server and filter out or deny transport in case of a signature match. Catalogued spam, once identified and reported by a Reporting Agent, can be blocked out by the rest of the Filtering Agents on the network."

        
        

Anti-Spam email services

                
The Spam Gourmet - http://www.spamgourmet.com/
Free, disposable email addresses which expire after a given number of messages, after which messages are "consumed with relish" by this email service.

Spamex - http://www.spamex.com/

A commercial disposable email address service.

Sneakemail - http://sneakemail.com/

A free service that you can use to generate disposable email addresses.

Mailshell - http://www.mailshell.com/

Paid service offering disposable email addresses and "intelligent" filters.

Spam Motel - http://www.spammotel.com/

Free service.

Spamcop - http://spamcop.net/

A paid service offering spam reporting, filtered email accounts and DNS blacklisting. There is, however, some controversy about this method (blacklisting) of preventing spam that you should perhaps be aware of. They also operate a free blacklist.

MailMoat - http://www.mailmoat.com/

"Virtually endless supply of disposable email addresses (we call them aliases) for you to pass out to anyone you want. All of them will forward to your existing email account, and YOU decide just how much email you want to be able to receive on each. If you're getting overwhelmed by junk mail on an alias, just expire it; MailMoat will reject any future email sent to that address with an "Invalid Email Address" error message - and even if it gets sold and resold a hundred times, the spammers still don't have your REAL email address and their database will be polluted with email addresses that are no good." (thanks to Don Williams for this link)

        
        

Research projects

                
Captcha - http://www.captcha.net/
A project run by Carnegie Mellon University school of Computer Science. According to their site, "CAPTCHA stands for 'Completely Automated Public Turing Test to Tell Computers and Humans Apart'. The P for Public means that the code and the data used by a captcha should be publicly available. Thus a program that can generate and grade tests that distinguish humans from computers, but whose code or data are private, is not a captcha."

        
        

Novel ideas

                
Curbside Recording - http://www.curbside-recording.com/message.html
Has a special page which includes a contract requiring payment for each spam received. Nice idea, yet I don't believe he's had any takers yet...


"Stopping Spambots: A Spambot Trap" Copyright © 2002-2014 By Neil Gunton. All rights reserved.
Website Copyright © 2000-2014 by Neil Gunton Fri 31 Oct 2014 17:53 (US/Pacific) (0.295s)      Top    Link    Report    Terms of Service