|
|
|
|
| Home | My | Articles | Search | Resume | Contact | Website | ||||||||||||
|
||||||||||||||||||
Topic:
Neil Gunton
Categories: Perl utilities, Spam prevention
Permalink: http://www.neilgunton.com/doc/spambot_trap
Copyright © 2002-2021 By Neil Gunton
Last update: Friday May 14, 2010 10:32 (US/Pacific) (edited Sat 25 Feb 2017 22:11 (US/Pacific))
16 pics
The Spambot Trap |
Introduction |
This document describes my experiences with spambots on my websites,
and the techniques I have developed to stop them dead. I assume the
reader has basic familiarity with Linux, Apache, mod_perl, Perl, MySQL and firewall rules using iptables -
each of these topics could fill a book, so I won't talk about
installation or basic configuration. I will, however, provide full
scripts and instructions on using these within the context of these
tools. If you'd like some basic pointers on getting set up using these
tools, then you could take a look at my short series of three Linux Network Howto articles.
Updates:2002-04-12: I've had a lot of feedback since the original slashdot article - thanks to all those people who have written with some really great ideas for alternative ways to foil the spambots. I've tried to incorporate some of these into the document, and also to give credit to people by name where it's due. Thanks again!2002-04-26: There's a new update on how the spambots seem to be "evolving" to avoid traps. 2002-05-01: Another update: The author of the new, scary spambot comes forward, and a few new links. 2002-06-29: One possible way to foil spambots that pretend to be browsers. 2002-09-05: I've added a new log snapshots appendix which shows snapshots of the badhosts_loop logs at various moments in time. 2005-10-15: New version of the badhosts_loop script, fixing a problem where spambots could connect to the website during the brief periods when the iptables chains are being regenerated. Also moved from the older ipchains to the newer iptables. Finally, added Update 4 on how to deal with Google's new bot, which pretends to be a browser. |
The Problem: Spambots Ate My Website |
I have a website, crazyguyonabike.com, which has bicycle tour journals, message boards and guestbooks. I started noticing around the end of 2001 that the site was getting hit a lot by spambots. You can spot this sort of activity by looking for very rapid surfing, strange request patterns, and non-browser User-Agents. After looking at the server logs, I realized a couple of things: Firstly, the spambots came from many different IP addresses, so this precluded the simple option of adding the source IP to my firewall blocks list. Secondly, there seemed to be a common behavior between the bots - even if this was the first visit from a particular IP address (or even a particular network, so no chance of just being a different proxy) they would come straight into the middle of my website, at a specific page rather than the root. This means that the spambots obviously had some kind of database of pages, which had presumably been built up from previous visits (before I'd noticed the activity), and this database was being shared between a large number of different hosts, each of which was apparently running the same software. Of course, this "database" could also simply be search engine results - apparently some spambots actually use search engines such as Google and Yahoo! to look for promising pages. Another distinctive behavior was that the spambots follow only those links which had certain keywords which would seem promising if you're looking for email addresses: "guestbook", "journal", "message", "post" and so on. On each of the pages in my site there were many other links in the navbars, but only links with these keywords were being followed. Also, robots.txt was never even being read, let alone followed. Moreover, the bot would come in, scan pages rapidly for maybe a few seconds, and then stop for a while. So it was obviously making at least some attempt to circumvent blocks based on frequency/quantity of requests. This was very annoying. For one thing, these things were picking off email addresses from my website (at that point, I was letting people who posted on my message boards decide for themselves whether they wanted their email addresses to be visible or not). But quite apart from that, it was taking up resources, and was just plain rude. I hate spam. I resent my webserver having to play host to people whose obvious goal is to cynically exploit the co-operative protocols of the internet to their own selfish, antisocial gain. So, I decided to do something about it. The first thing I did was to look at the User-Agent fields which were being used by the bots. There were a variety, including variations on the following (note: these particular user agents have long since disappeared - check the log snapshots for recent logs of the user agents that current spambots are using):
I searched the internet for references to these strings, but all I found was a slew of website statistics analysis logs. This meant that these particular spambots obviously got around. It was also discouraging, because there was no mention anywhere of what these things actually were. I was surprised that there seemed to be no discussion whatsoever of something that seemed to be pandemic. Then I found a couple of other websites with guestbooks that had actually been defiled by these spambots: (if you follow these links and you don't see a lot of empty messages left by the above user agents, then that means the webmaster of the site has finally found a way to stop it, so good for them...)
I reckon the spambots didn't really intend to leave empty messages. They just tend to want to follow links with the keyword 'post'. So if the guestbook posting form has no preview or confirmation page, then the spambot would leave a message simply by following this link! My guestbooks and message boards have a preview page, which is probably why I hadn't had any of this. Anyway, I started thinking about what kind of program this thing was. First of all, it comes from all kinds of different IP addresses. I couldn't quite believe that this many different IP addresses were all intentionally using the same software, of which I could find absolutely no mention anywhere on the Web. This made me think it might be some kind of virus/trojan/worm or whatever that silently installed itself on people's computers, and then used the CPU and bandwidth to surf the Web without the owner being aware of it. I thought that if this was the case, then it must be sending the results somewhere - and if we could find out where, then we could go about shutting the operation down. The only website which I have found associated with any of the IP addresses which have hit me is http://cheap-shells.com/. Looking at the site, this seems to be a company that supplies, uh, cheap shells. Four separate IP addresses hosted by this outfit seemed to have been sources of spambot visits at different times. (Update: Since mentioning this in the original version of the article, the people who run cheap-shells.com have assured me that this was an errant customer (who's since been dropped), running some kind of freely available spider, rather than being a virus. That's big progress! So, it seems that this thing is not a virus or trojan after all... no word yet however on the real identity of the spider program...) Regardless of what the thing is, I have had no luck at all in getting any help from the sysadmins at ISP's I have contacted. A typical exchange was the one with a guy at Cox internet, which was where a persistent offending IP address was sourced. He just couldn't be bothered, and eventually told me that spidering was not against the law, or their terms of service. I asked whether actions which were blatantly obviously geared toward the generation of spam were against their terms of use, but he never replied to that. I had no more luck anywhere else: Nobody had heard of this thing. I even sent an email to CERT, but no response. If you have any ideas, I'd love to hear them. So anyway, I turned instead to thinking about how I could erase these pests from my life as much as possible. This document is about my quest to stop spambots (not just this one, but ALL spambots) from abusing my website. Hopefully it will be useful to you. |
Overview of the Spambot Trap |
There are three main parts to the technique which I outline here:
There are various components to the Spambot Trap, including the badhosts_loop Perl script, the BlockAgent.pm module, iptables config, MySQL database, httpd.conf, robots.txt, and your HTML files. These are all covered in the sections below. |
Banishing 'mailto:' |
The first and most urgent thing you need to do is to get email
addresses off your website altogether. This means, unfortunately,
banishing the venerable mailto: link. It's a real shame that perfectly
good mechanisms should be removed because of abuse, but that's just
the way the world is these days. You need to be defensive, and assume
that the spammers will try to take advantage of your resources as much
as possible.
It's an arms raceThe important thing that you need to realize is that no matter what blocks we put in place, this game is an arms race. Eventually the spambot writers will develop smarter bots which circumvent our techniques. Therefore you want to have a failsafe, which will prevent email addresses from getting into the hands of the spambot even if all else fails. The only real way to do that is to completely remove all email address from your website.Contact formsYou should replace the mailto: links with links to a special form where people can type their name, email address and message. A CGI can then deliver the email, and your email address never has to be disclosed. There are a number of different mailer scripts out there - just be careful to check for vulnerabilities which could allow malicious users to use the form to send email to third parties (i.e. spam, ironically enough) using your server. The formmail script is popular, but it apparently still has vulnerabilities (thanks to Christopher Fisk for pointing this out, and Nicholas Clark for pointing me to a good replacement for formmail). Another alternative is Soupermail (thanks to Emery Jeffreys). The Embperl package has a simple MailFormTo command to send an email from a form.Since I have seen guestbooks out there which have been extensively defiled by spambots, I would add that you should have a preview screen on your contact forms. This will ensure that an email doesn't get fired off simply by a spambot following the 'post' or 'contact' link (which it will likely try to do). Alternatives to totally banishing mailto:There are alternatives to completely removing email addresses, but they all depend on the stupidity of the spambot, and so could be compromised by a new generation of pest. These include:
As you can see, there are many ways you can make email addresses harder for spambots to recognise. It all depends on your own expertise and preferences. Still, in my opinion the only totally safe way to ensure spambots can't harvest email addresses is to totally remove them from your website! Can't get around that one, no matter how smart they get... Next, we look at setting up the tools required to make our spambot trap work. |
MySQL |
To implement our spambot trap, we first need to set up a MySQL database, where we store records of the hosts which are to be blocked. This doesn't have to be MySQL, but I use it because it's extremely fast, and very appropriate for this kind of application. You need to create a new database, called 'badhosts'. You then create a table, again called 'badhosts', with the following structure:
You could use the dump provided above to load directly into your database: shell> mysqladmin create badhosts shell> mysql badhosts < badhosts.dumpThat's about it! The fields which are marked as 'indexed' are the only ones which need indexes, because they are searched on to see if a particular IP address has been previously blocked, and also to see which blocks should be removed because they've expired. If you have access privileges set on your MySQL databases, then you need to allow the Apache user (usually 'nobody') access. The other script that will require access is badhosts_loop, which runs as root. Next, we look at the script that populates this database. |
BlockAgent.pm |
Note: This module currently only works with Apache 1.3.x, not the newer Apache 2.x. This is because the Apache team changed the module API for Version 2, and I haven't had the time to figure out exactly what needs to be done to make it work with both versions (if that's even possible). If anyone has any clues on this or wants to take a stab at it, then let me know... meanwhile, apologies to those of you running Apache 2.0. The BlockAgent.pm Apache/mod_perl module is taken from the excellent book "Writing Apache Modules with Perl and C" by Lincoln Stein & Doug MacEachern (O'Reilly). This script basically acts as an Apache authentication module which checks the HTTP User-Agent header against a list of known bad agents. If there's a match, then a 403 'Forbidden' code is returned. The script compiles and caches a list of subroutines for doing the matches, and automatically detects when the 'bad_agents.txt' file has changed. I have found that it has no noticeable impact on the performance of the webserver. This script is useful in the case where you know for certain that a certain User-Agent is bad; there's no point in letting it go anywhere on your site, so it's a good first line of defense. We'll cover how to add this module to your website a little later, along with the rest of the configuration settings in the section on httpd.conf. Of course, one of the first arguments you'll see with regard to this method of blocking spambots is that it's easy to circumvent, by simply passing in a User-Agent string which is identical to the major browsers out there. This is perfectly true, but don't ask me why the spambot writers haven't done this - maybe it's a question of pride or ego, they want to see their baby out there on record in Web server logs. I honestly don't know. The main point is that at present, the User-Agent header CAN be used very effectively to block most bad agents. But, I have added more features so that we can also block agents which look ok, but behave badly by going somewhere they shouldn't - the Spambot Trap. More on that soon. You'll notice that the bad_agents.txt file which I have supplied here is very comprehensive. A good strategy here is probably to save the full version somewhere (perhaps as bad_agents.txt.all), and just keep the ones you actually encounter in the bad_agents.txt file. Then you keep the list shorter, and more relevant to what actually hits you. For example, my bad_agents.txt file currently has the following lines in it, because these are the spambots that I see most frequently (current as of 2003-06-22 - I'll update it occasionally): ^[a-zA-Z0-9]+$ ^Baiduspider ^Franklin Locator ^IUFW Web ^Mac Finder ^Missigua Locate ^Missigua Locator ^Missouri College Browse ^Program Shareware ^Ram Finder ^Under the Rainbow ^WebFilter ^WEP Search ^Xenu Link Sleuth ^ZeusFirst of all, I know that some people will probably complain that some of these agents are not technically 'spambots'. I have blocked them based on the behavior which I have actually observed on my sites. Some agents will often ignore robots.txt, and moreover will demand many, many pages extremely quickly. This is not a well-behaved robot, so it's not welcome on my site. Some so-called legitimate agents have so frequently fallen into the trap that I decided to treat them as hostile (or, at the very least, incompetent). You can make your own decisions based on the behavior which you observe. For me, I tend to only add agents to my blacklist when I have seen a lot of bad activity over a period of more than a week or so. You'll also notice from this that BlockAgents.pm is very flexible, being able to take full advantage of the excellent regular expression capabilities of Perl. This means you can capture a lot of different agents with just one line. For example, the very first line catches all the variations of the agent which passes in random strings of capital letters, e.g. FHASFJDDJKHG or UYTWHJVJ. The spambot obviously thinks it's being pretty smart by looking different each time, but by using an easily identifiable pattern, it shoots itself in the foot. Hah. The original version of the BlockAgent.pm script is well explained in the O'Reilly book, but I've added an extra hook that checks to see whether the client is accessing any of the spambot trap directories. If it is, then we add an entry to the MySQL database (you could use another relational database if you want, as long as it's accessible from Perl DBI). The first time an IP address is blocked, an expiry of one minute is set. If the same host subsequently comes in and falls into the trap again, then the expiry time is doubled. And so on. This way, the block gets longer and longer, in proportion to how persistently the spambot revisits our website. Also, the initial block expires very quickly, so that if the spambot is coming from a large network such as AOL (which works through multiple IP address proxies), you won't be blocking everyone on that network. The final benefit of quick expiry is that you probably won't build up a very large list of blocked IP addresses in your iptables list, thus saving resources. The expiry time is a hint for the minimum amount of time the block should exist, rather than being an exact measure. This is because the badhosts_loop script (see below) has a cycle time of 10 minutes, which is the amount of time the script waits before regenerating the blocks (if no other hosts get blocked). This means that a block which has an expiry of one minute could potentially be in place for up to ten minutes. This is actually a benefit, since ten minutes is a pretty good period of time to block a host in the initial case. If the host keeps re-offending, then its expiry time will gradually increase, the number of minutes being doubled each time - 2, 4, 8, 16, 32 and so on. Once an IP address is blocked, the spambot can't even connect to our web server, since we use 'Deny' in the iptables rule. This means that no acknowledgement is given to any packets coming in from the badhost, and as far as they know, our server has just gone away. Hopefully, after this happens for long enough, our server will be taken off the spambot's "visit" list. Another nice little side-effect of this is that the spambot will probably have to wait for a while before giving up each connection attempt. Anything that makes them waste more time is ok by me! BlockAgent.pm notifies the badhosts_loop script that something has happened by touching a file called /tmp/badhost.new. The badhosts_loop file checks this file every few seconds and if it has changed then it knows that a new record's been added to the database, and it needs to re-generate the blocks list. The BlockAgent.pm script is our alarm system. It's what tells us that something happened. In order to act on this information, we need to be able to add rules to the iptables firewall. We'll cover this next. |
iptables |
The iptables module is a very nice way of providing a good level of basic network security to your server. It's a very easy way to configure who can and cannot have access to your machine, at the most basic level - network packets, ports and protocols. The example iptables config file given here is complete, but you should customize it to your own needs in terms of what services you need to be visible to the world. Remember: If you are not really using a service, then turn it off, and block that port. The best security policy is one where you say "Block everything by default" and then only explicitly allow those services that you know you need. This is much better than allowing everything by default and then attempting to figure out everything that you should block - you'll always miss something. Incidentally, be very careful experimenting with iptables, especially if you are working on a remotely hosted server which you don't have physical access to. It's quite easy to block yourself completely, in which case you'll find it very difficult to login to fix the problem! The bit of this script which is most relevant to the spambot trap is that we create two chains, called 'blocks0' and 'blocks1'. These are our own custom chains, which we can then add rules to. The badhosts_loop script will flush these chains and build them back up whenever a spambot falls in your trap. Once the spambot's IP address is on the blocks list, that host cannot connect to your server at all (at least, via HTTP - other protocols such as ssh are left open for safety, in case you manage to get yourself blocked). The iptables.conf file is an executable script which should be run at bootup along with all the other services in your /etc/init.d/ or /etc/rc.d/ directory. You'll need to consult with your Linux distribution documentation to see how to set this up (they all seem to do it a little differently - for example, Debian uses update-rc.d, RedHat uses chkconfig, and so on). You only need to run iptables.conf once after every boot, to set up the blocks chains. You could aso just add the following line to your /etc/rc.local file (or equivalent), so that the script is automatically run on reboot: /path/to/iptables.conf The reason why we need two blocks chains (blocks0 and blocks1) is covered in the next section on badhosts_loop - this is the the script that actually adds the firewall rules. |
badhosts_loop |
You run this script in the background, as root. It has to be run as root, because only root has the ability to add rules to the firewall. The script spends most of its time sleeping. It wakes up every five seconds or so and does a quick check on /tmp/badhost.new. If this file has been changed since the last time it looked, then it goes and re-generates the firewall blocks list with all the current (non-expired) blocks. If nothing else happens, then the script will automatically cycle every ten minutes (by default - you could adjust this for your situation, by changing the value of the $seconds_in_cycle variable), to ensure that blocks really do expire even if there is no new activity. The first version of this script used a single blocks chain, and it was discovered that spambots could connect to the webserver during the brief period when the blocks chain had been flushed and then was being rebuilt. The new version uses two chains (blocks0 and blocks1) round-robin style. For example, if it used blocks0 the previous time the rules were updated, then this time it will use blocks1, and so on. First we add the currently active block rules to the current chain, and then flush the previously used chain. This ensures that there is no period during which there are no blocks rules. The new method means that there is always at least one chain that has active block rules, and during the rebuild process, these two chains will briefly overlap before the old one is flushed. This ensures seamless blocking, with no gaps for the spambots to sneak through. You need to add badhosts_loop to your startup scripts, so that it is started every boot. As with the iptables.conf script, you will need to consult with your Linux distribution documentation to see how to set this up (they all seem to do it a little differently - for example, Debian uses update-rc.d, RedHat uses chkconfig, and so on). You could aso just add the following line to your /etc/rc.local file (or equivalent), so that the script is automatically started up on reboot: /path/to/badhosts_loop --loop &This will start the script looping in the background. The script automatically checks to see if it is already running, by attempting to lock /var/lock/badhosts_loop.lock. If the file is already locked then the script will exit with an error message. If you want to just run the script once, without looping, then just omit the '--loop' option. This can be useful for testing. Also, if you want to just prod the script to make it update the iptables blocks, you just need to touch the alert file, /tmp/badhost.new. Logging is done to /var/log/badhosts_loop.log by default. Every time the script generates the blocks list, it writes a list of all the blocks to the log. This is a good place to monitor if you're interested in what hosts are being blocked. You can see examples of the log output in the Spambot Trap Log Snapshots article. The log shows the IP address which is being added, then, in parentheses, the power of 2 which is being used to calculate the expiration time in minutes. For example, (3) means an expiration of 23, which is 8 minutes. The power is increased by one every time the same IP is blocked. Next we have the start and end dates/times for this block, the general reason for the block (agent, trap or manual - see below) and finally the name of the User-Agent which committed the crime. This can be useful for quickly seeing whether you need to add a new one to the bad_agents.txt file.
The 'reason' for the block has three possibilities: Either the bot was
recognised as a bad agent immediately through its User-Agent HTTP
header ('agent'), or it fell into the trap directory ('trap') or else
it was added manually ('manual'). The manual option is there so that
you can use the block_ip script to add bad
hosts yourself explicitly, from the command line. For example, if you
notice a lot of blocks occurring from the same subnet, then you may
want to block that subnet completely for efficiency, rather than
having each individual IP address needlessly clogging up your
filters. The syntax is block_ip '207.150.173.0/24' '20' 'Abusive subnet'The script will automatically touch the /tmp/badhosts.new file, so that the block will be enabled immediately (or within seconds). The badhosts_loop script is a pretty stable program that should just sit there and chug quietly, not taking up much in the way of resources. Checking for a file being changed every five seconds is not a big deal in Unix, so you shouldn't even notice it. Now you have to create the trap itself - the spambot_trap directory. |
spambot_trap/ Directory |
You can create this directory anywhere on your server. We will create an alias the httpd.conf to access it. I put mine in /www/spambot_trap/. The point is, this doesn't have to be a real directory under your webserver directory root. If you use the Alias directive, then multiple websites can access the same spambot_trap directory, potentially through different aliases. You can use the sample tarball as a starting point, it has subdirectories and links which the spambots I have seen find irresistable. You should create your own image file for the unblock_email.gif file, to have a valid email address of your own. The spambot_trap and spambot_trap/guestbook/ directories are not used directly to spring the trap. This is because I wanted to have a warning level, a lead-in, where real users would be able to realize they are getting into dangerous waters and could then back out. You're going to be placing hard-to-click links on your web pages which lead into the real trap, and there's always a chance that a real user will accidentally click on one of these. So, some of the links will point into the warning level. I have made a GIF image which contains a warning text. Why an image? Mainly because spambots can't understand images, and I didn't want to give big clues like "WARNING!!! DO NOT ENTER" in plain text. So, the user sees the warning, the spambots don't. If the spambot proceeds into any of the subdirectories (email, contact, post, message), then the trap is sprung and the host is blocked. You also need to try to stop good spiders (e.g. google) from falling into the spambot trap and being blocked. To do this, we utilize the robots.txt file. |
robots.txt |
This should allow good robots (such as google) to surf your site without falling into the spambot trap. Most bad spambots don't even check the robots.txt file, so this is mainly for protection of the good bots. You'll see that we list a bunch of directories under '/squirrel'. This could be anything; you'll set an alias later in httpd.conf. In fact, you may even want this to be dynamically generated (see later, under Embperl), so that you can quickly change the name of the spambot trap directory if the spambots adapt and start avoiding it. At present, a static setup should work just fine, however. Next, we need to look at the bait - links within your HTML files which lead the spambot into the trap. |
Your HTML Files |
Here's an example of HTML with links into the spambot trap:
<HTML>
<BODY BGCOLOR="beige">
<A HREF="/squirrel/guestbook/message/"></A>
<A HREF="/squirrel/guestbook/post/"><IMG SRC="/images/guestbook.gif"
WIDTH=1 HEIGHT=1 BORDER=0 ALT="Warning text"></A>
Body of the page here
<TABLE WIDTH=100%>
<TR>
<TD ALIGN=RIGHT>
<A HREF="/squirrel/guestbook/">
<SMALL><FONT COLOR="beige">guestbook</FONT></SMALL>
</A>
</TD>
</TR>
</TABLE>
</BODY>
</HTML>
Spambots tend to be stupid. You'd think they would check for empty
links (which don't show up in a real browser), but they don't seem
to. Sure, they may get smarter, but meantime you might as well pick
the low hanging fruit. So, the very first thing in the body of your
HTML should be an empty link which goes straight into the trap proper
- not the warning level, but the actual trap itself. This is because
there is no way for someone using a real browser to click on this
link, and good spiders will ignore it anyway because it's in the
robots.txt file.
You could also use fake HTML tags in between the link anchors, something like this: <A HREF="/squirrel/guestbook/message/"><faketag></faketag></A>(Thanks to Paul Williams for that one) Incidentally, Joseph Wasson points out that users can still accidentally find these "hidden" links if they use the TAB key to cycle through links. One more thing to consider. Still, this will happen relatively infrequently, and those unfortunates who do fall in the trap by mistake should at least have an email address to fall back on - the one displayed on the "You've been blocked" page. Come to think about it, you might want to make that email address not only an image, so that text-based browser users can get themselves unblocked too! We also use a one pixel transparent GIF (a favorite web bug technique) to anchor a link to the trap, just in case the spambot is smart enough to avoid empty links. If we put this as the very first thing in the body, then it'll be pretty hard for a real user to click on, since it's only one pixel in size. But a spambot will quite happily go there! Finally, there is an example of a non-graphic, text based link. This will be placed on the right side of the screen by the table, and the text will appear in the same color as the background (in this example, beige). The link does not go straight into the trap, but into the warning level, because with this one there is a bigger chance that real people could click on it accidentally. The link may be invisible, but it's still there, and someone could find it. So, they get to see a nice warning, and they should back off from there. But the spambot won't. By the way, we have the link going to /squirrel/guestbook/ rather than just /squirrel/ because some of the spambots seem to specifically follow links with certain keywords, e.g. 'guestbook', 'message', 'post', etc. One caveat: These single-pixel images and "invisible" links will show up on browsers for the blind, and other text-to-speech browsers. Moreover, they won't be able to read the warning image! So, you might be more comfortable just using the empty link option (not sure if braille browsers follow those too...). Something to think about. You could also make the warning text into plain text rather than an image, I doubt if the spambots parse any meaning from text, in reality. Another idea: Try putting some warning on the front page of your site, to the effect that the spambot trap is there, perhaps with a link to a page where they can find out more. Finally, you might put ALT text in the IMG tag so that people with text browsers can at least get a clue. Try to use non-obvious text, not stuff like "SPAMTRAP WARNING DO NOT CLICK", which is the sort of thing a spambot might be programmed to recognise... perhaps a haiku:
shall see much time pass before he is forgiven Walter Loscutoff suggests putting the hidden links in a DIV, which is set to be invisible. Then the trap code could look perfectly normal to the bot, but be invisible to a normal user. For example (the following code goes between the BODY tags):
<DIV ID="SpamTrap1DIV" STYLE="position:absolute; left:0; top:0; width:50; height:50;
clip:rect(0,50,50,0); z-index:1; visibility:hidden;">
<A HREF="/squirrel/guestbook/email/">Click here for emails</A>
</DIV>
And a second option is to simply make the DIV visible, but place it off page ...
<DIV ID="SpamTrap1DIV" STYLE="position:absolute; left:-100; top:-100; width:50; height:50;
clip:rect(0,50,50,0); z-index:1;">
<A HREF="/squirrel/guestbook/email/">Click here for emails</A>
</DIV>
Both of these options would be a headache for a spambot coder, because
many sites use DIVs that might start invisible or off-page - thus the
spambot has no way of knowing what might be a trap and what's regular
HTML.
You can sprinkle these hidden links all around your HTML files. I put them in every single one, since I use Embperl templates which make that sort of thing very easy. Finally, it's possible for normal browsers to fall into the trap via keyboard shortcuts - in Internet Explorer, the Tab key cycles through links on the page. If the trap is one of the first links (as it should be) then hitting Tab followed by Enter might get the unwary visitor blocked by accident. One way to prevent this is to add an onClick javascript handler to the link, which tells the browser not to follow the link by returning 'false'. Unfortunately IE does not honor this convention, so a little extra is needed to make it happen for most Windows users:
<A HREF="/your/trap/dir/" onClick="event.returnValue = false; return false;"></A>
|
Embperl |
Embperl is a very nice templating solution for embedding Perl in your HTML pages, making it all very dynamic. I use it for all my web development. It also has features which make it easy to construct your websites in a modular, object-oriented manner (I wrote a tutorial for EmbperlObject). The point of this is to make it easier to change the spambot trap directory without having to edit a whole bunch of files. We pass an environment variable to Perl from httpd.conf (see below), which says what the trap directory is called. We then use this in Embperl to substitute into the HTML and robots.txt files at request time. Thus if we wanted to change the name of the trap from 'squirrel' to 'badger', then we only need to change httpd.conf, restart apache, and we're done. All the links in the HTML are dynamic, as is robots.txt (see the samples above). Now, we bring it all together in the Apache configuration file. |
httpd.conf |
You need to have mod_perl installed before you can use BlockAgent.pm. You should take a look at the sample given above, and integrate these directives into your own virtual hosts. The most important lines are: Alias /squirrel /www/spambot_trap PerlSetEnv SPAMBOT_TRAP_DIR squirrelYou should set the 'squirrel' name to whatever you'd like for your website; you'll then access the trap using a URL something like http://www.example.com/squirrel/guestbook/message/. This will spring the trap. You also need to set up the BlockAgent.pm access handler: <Location /> PerlAccessHandler Apache::BlockAgent PerlSetVar BlockAgentFile /www/conf/bad_agents.txt </Location>This ensures that all accesses to your website will go through BlockAgent.pm first. You should choose your own location for the bad_agents.txt file. Finally, you might want to install Embperl so that you can embed Perl into your HTML code (always executed on the server side, never seen on the client side): # Set EmbPerl handler for main directory <Directory "/www/vhosts/www.example.com/htdocs/"> # Handle HTML files with Embperl <FilesMatch ".*\.html$"> SetHandler perl-script PerlHandler HTML::Embperl Options ExecCGI </FilesMatch> # Handle robots.txt with Embperl <FilesMatch "^robots.txt$"> SetHandler perl-script PerlHandler HTML::Embperl Options ExecCGI </FilesMatch> </Directory>That about does it. You should now have the setup which will allow you to block spambots. You'll probably be interested in monitoring what happens... |
Monitoring |
This simple script just tails the badhosts_loop log. You'll have fun (I do) seeing what comes on your site and promptly falls into the trap, and then SPLAT. No more spambot. Heh heh heh. |
Conclusions |
This setup works pretty well for me at the moment. I've no doubt there
are flaws in my design, but it seems stable and is "good enough" for
the time being. If you can see any improvements then I'd love to hear about them. To finish up, here's a summary
of the strengths and potential weaknesses of the Spambot Trap system.
|
Strengths |
|
Weaknesses |
|
Possible future enhancements |
|
Alternative Ideas |
If you find the concept of blocking IP addresses distasteful (and I
would understand that), then there are other ideas which have been put
forward by a number of people. These include just feeding the spambot
lots and lots of fake garbage email addresses (e.g. Wpoison). This "poisons"
the spambot's harvest. Or, slow it down: Feed the spambot very slow
loading pages. Or even try to crash them with faulty formatted pages
(I love that one - exploit any possible buffer overflow bugs in the
spambot by feeding it an email address thousands of characters in
length)...
These are interesting ideas which are more pro-active, but also involving more in the way of resources on your server. Every slow-loading connection means one Apache process being tied up for the duration of the request. But, it's a worthy goal - I think if people try out a variety of methods, then the biodiversity of it all will also be harmful to the spambots. The more things they have to guard against, the better, in my opinion... I may try to incorporate these concepts in future versions of the spambot trap. A choice of methods can only be a good thing... |
Appendices |
Update 1: April 26th 2002: Evolution in action - the Spambots Strike Back |
Changing expiry times to minutes rather than daysSince the original publication of the article I decided to change the units of the block expirations from 'days' to 'minutes'. Having thought about the problem of accidental and 'one-off' offenders, it seemed like a good idea, and seems to work just fine. Now, the badhosts_loop cycles every 10 minutes (rather than every day). Initial blocks expire in 1 minute, but may be in force for up to 10 minutes (until badhosts_loop gets around to regenerating the blocks). This seems to be a good compromise between measuring expirations in minutes, and not having TOO small a block expiry time - being blocked for 10 minutes is no big deal, and is enough to make most spambots give up for a while. And if they come back, the blocks just double each time.
Evolution in action - the Spambots Strike BackI thought all was well until last night, when I noticed some suspicious activity on my crazyguyonabike.com server logs. There seemed to be a lot of activity in the guestbook, which was the behavior I had seen before from the DSurf et al spambots. But, this thing wasn't falling into the trap. I quickly checked the User-Agent, and found it was "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)", which is quite a common browser. On looking at the logs more closely, I quickly realized that the spambots seemed to have gotten much, much smarter. Here are the relevant access, agent and referer log entries. I could be wrong about this, in which case I apologize in advance to the person who owns this IP address (apparently it resolves to tmp002048081386.STUDENT.CWRU.Edu - but naahhh, a student would NEVER be a source of this kinda thing, right??? (There's a new twist to this story - see the next update)... but based on what I've seen so far, I'm thinking the the spambot is doing the following:
In a nutshell, this particular spambot seems to have "evolved" to the point it is able to come into your website, one page at a time, working from search engine results. It looks just like a standard browser. It doesn't have to follow any links from your pages. So, how on earth do you block a spambot that looks exactly like a person? (It should be noted that not all spambots are doing this - I have just observed the one instance. But you can be sure that if it works, then it will spread. However, the Spambot Trap is still very useful for stopping badly behaved robots and other, more stupid spambots - of which there are many out there.) The long term answer, I think, is that you just can't easily block a spambot that behaves exactly like a real browser. You could do all kinds of tricks, like requiring JavaScript, or cookies, or analysing behavior such as the non-loading of images or lack of a referer field. But in the end, all these will be smoothed over as the spambot writers add features to the toolkit. We just have to accept that the things will eventually look exactly like people browsing our site. I for one am not too keen on requiring JavaScript or cookies on my site, since I personally turn JavaScript off for security reasons, and also to stop those annoying popup ads. And a lot of people have privacy concerns over cookies. And anyway, all the spambot authors have to do is incorporate cookies (can't be all that hard). The JavaScript engine would be more tricky, but again - I'm not too keen on requiring JavaScript as a fundamental foundation of the internet. It's introducing just another level of complexity on something that was beautifully simple - HTML and HTTP. In the short term, I believe I've found a way to block the new spambot. I'll experiment to see if the technique is effective - but would it be good to publish it here? Maybe, maybe not. The argument for total openness says that everyone should know about a technique that successfully blocks a certain spambot. But then the spambot authors also get to hear about it, and promptly plug the hole. So - better to keep our little weapons to ourselves, and keep the spambot writers in the dark, or better to be totally open, and then have the block neutralised quickly? That's a tough one... So, why give the spambot publicity at all? Because I don't believe that we are best served by silence on these issues, in the wider context. The open source community has proved again and again that openness is the best policy when dealing with bugs, security issues and other threats. I am trying to bring this issue to the attention of a larger audience in the hope that some smart person out there will be able to figure out the next step in the evolutionary process. Of course, the final answer is to remove email addresses from your website altogether, or else obfuscate them using one of the techniques mentioned earlier in this document. That's fine. But I am still incensed that I am required to play host to these things, especially given how repulsive I find spam in general. The spambots continue to use up my server resources, and that makes me mad... I am thinking about ways to automatically analyse the behavior of the spambots so that I can block them based on their actions. Perhaps keep a database of IP addresses of hosts requesting documents. Using a fast, simple database like MySQL that shouldn't be a problem - anyway, just about every page in my website is already dynamically generated, so it's no big deal. If we keep track of the activity, we could note that a) no images have been loaded, b) the User-Agent says it's a browser rather than a spider, and c) no referer fields - the current behavior of the spambot would betray it. But, as I mentioned earlier, it would be relatively easy for the spambot to start loading images (and just discarding them), and passing in referer headers. I also thought about using the behavior to detect (after a few requests) some suspicious activity, and then redirecting the browser concerned to a page which requires a human to answer some kind of question in order to validate themselves as a non-spambot. For example, if a particular IP address loads a lot of pages, but no images, and claims to be Mozilla compatible, AND provides no referer fields, then that might be good criteria for a quick checkup. It's just a vague idea at this point. It could be some kind of basic multiple choice form, just enough to require a real person. Obviously there should be a database of questions, which needs to be updated regularly, to stop the spambots being "taught" the correct answers. After submitting the form, the user is then returned to the page where they were before, and can continue. I don't know how feasible this would be. It even crossed my mind to have some kind of third party, non-profit website, something like "personvalidator.org", which could be used to validate people from multiple websites using this method. The questions and answers could then be centralized on a site which is specifically designed to root out non-humans. The website could have some kind of very simple CGI API for passing in the original web page, so that the user can be redirected back again after validation. This may be a silly idea, but it's worth at least thinking about... an alternative to the multiple choice questions is to have an image which only a human can read, and they then have to type the text or number into a form before continuing. This has already been implemented on some sites, including Go Daddy Software's DNS lookup page (here's a example). Another famous example is My Yahoo!, where you have to read an image while registering. They have a special workaround for blind people, who are instructed to call Yahoo!'s customer care department and it all happens over the phone. I wonder if the spambots will eventually develop the ability to converse with people over a telephone? Perhaps they will then be talking with a Customer Care bot on the other end. Shudder... An interesting research example is Captcha, a project run by Carnegie Mellon University school of Computer Science. According to their site, "CAPTCHA stands for 'Completely Automated Public Turing Test to Tell Computers and Humans Apart'. The P for Public means that the code and the data used by a captcha should be publicly available. Thus a program that can generate and grade tests that distinguish humans from computers, but whose code or data are private, is not a captcha." - Very interesting research. In SummaryOf course, I realize that I could just sit back and relax and let these things try their darndest to get email addresses, because they won't succeed. After all, it's not exactly going to kill my server. But that's not the point. Let me be clear: I REALLY HATE SPAM. I detest the way that the open, co-operative, sharing nature of the internet is being co-opted by these people. They are forcing everyone to be more closed, more fearful of doing anything on the Web. I resent being forced to accept these things to surf my website freely.So. What will the next step in the evolutionary process be? Will we be forced to just live with these things, these software versions of mosquitoes, as being part of the natural ecology of the internet? Or can there be a technical solution to the spambot pest? Ideas most welcome... |
Update 2: May 1st 2002: The author of the new, scary spambot comes forward |
The author of the new, scary spambot comes forwardShortly after the second slashdot article, I got an email from someone who claimed to be the author of the new, improved spambot. Checking the IP address of the sender against my server logs, I was able to confirm that this was indeed the same as the source of the spambot. He explained in some detail that this was just an experimental academic project of his, and it wasn't being done in any kind of commercial context. This is not related to DSurf, PBrowse, QYTRWYTR et al. He was quite embarrassed to have the thing exposed in this way, and he apologised for the inconvenience... I am inclined to believe him, for what it's worth. It's nice to have some openness for a change in this sordid arena... to tell you the truth, I am both gratified and mortified. Gratified that he should have come forward so quickly to assure me that this was an innocent experiment in Web spidering, and mortified that I was perhaps giving the spambot writers a nice little template on how to write their next generation of spambots. D'OH!I know, some of you will reply, very cynically, that the guy is just trying to cover his ass in the wake of being exposed - and who knows, you may be right. But, I am inclined to believe my intuition, and given the tone of this guy's email and the sheer amount of detail/context he has provided in a very short space of time, I really think he's telling the truth. It's an interesting project, after all, spidering. He has written me quite detailed emails about his project, and I for one am reasonably satisfied that he is for real. However, regardless of the source of this particular spambot, it doesn't change the basic message - the spambots can (and will) evolve beyond their current rather basic state. The question is, what do we do about it? In an interesting way (to continue the nature analogy), we could look at this particular spambot as being like a harmless version of a virus which we use to vaccinate people against the real thing. It may have done us a service, by demonstrating what can be done, without actually doing it for real. But still, what to do about the real thing? One solution which has been suggested by a few people is to just block all spiders (including google) from the parts of the website that include stuff like guestbooks and message boards. To me this is an overkill solution that essentially means taking apart the World Wide Web as we know it, fragmenting and segmenting it so that it is no longer a comprehensively connected network of nodes. The vast majority of people these days surf via the search engines, so this is just too drastic for me. In any case, my community site is all about bicycle touring, and the guestbooks and message boards contain lots of tips and suggestions which could be interesting to other people. It's part of the whole reason for having community websites. So the idea of hiding this stuff seems counter productive. However, this approach could work for someone who isn't all that concerned about being found on google - so it's worth at least considering. Another idea involves generating dynamic URL's within the site, so that the structure is constantly changing - in other words (assuming the website is totally dynamically generated) all the links from page to page have some component that "decays" and is invalid after a while. So, you allow google to surf the site, but all the links it gathers are effectively useless. When people find stuff via google they are redirected to the front page of your site, where they navigate manually to find what it was they were after. Again, to me this seems too extreme. I want people to find stuff on my site using Google! I don't want to throw the baby out with the bathwater... So, there are lots of good ideas out there, all of which are worth considering, even if you eventually decide that they are not for you. Chances are, people will come up with all kinds of clever tricks to counter these beasts. And so we come to one of the classic conundrums - do you stay quiet about these things, develop your own little defenses, keep your head down and hope that it doesn't get any worse? Or do you expose the beasts to the cold light of day and examination by many thousands of eyes (and over 20,000 people viewed the article the first time around), thus ensuring that other webmasters are at least aware of the kind of things that are out there, prowling their sites? It's very true that openness allows the spambot writers to hone their tools. But I think that the "open source" model of co-operation in the webmaster community calls for full disclosure and open discussion of these practices. To hope that these tools will somehow not proliferate is wishful thinking - and futile, in my opinion. Better to just assume that these techniques will become common knowledge among those writing spambots. So why not have it become common knowledge among the victims? This has at least crystalised a thesis that has haunted me for some time now: That our current ability to foil the spambots depends mostly on foibles and flaws in these programs that are actually very simple for the spambot developers to correct, if they put their minds to it. I think it's just that, up to now, they really haven't had to try all that hard. But, as websites get more "prickly" and develop defenses against "hostile" spidering, it's also inevitable that if it remains profitable to scrape web pages looking for email addresses, then the spambots will "evolve" eventually to look just like standard browsers coming onto your site. It then becomes even more urgent that we respond in the only way we can: Remove all email addresses and other personal information that can be machine-read from our websites. Use contact forms, image files, JavaScript, whatever it takes - but just ensure that these spambots cannot freely harvest our personal information. In the meantime, the Spambot Trap (and other tools like it - see the links below) can help to stem the rising tide of website abuse! Good luck... |
Update 3: June 29th 2002: A possible way to stop spambots that pretend to be browsers |
One possible way to stop spambots that pretend to be browsersThe "doomsday scenario" was that eventually spambots would evolve so that they look just like ordinary browsers, whereupon it would become very difficult to distinguish them from real people. However, it turns out that it is in fact possible to corner these beasts too. Here's how it works: We take advantage of the fact that the spambot is pretending to be a browser, by using the User-Agent header.I assume you are using the spambot trap as described above. If so, then you have hidden links in your HTML which go to the trap directory. Ordinary users should not follow these links, because they are effectively hidden from normal browsers, and good robots will avoid the trap because it is included in robots.txt. What we do to fool the "stealth" spambots (that pretend to be browsers) is to take advantage of the fact that all browsers (except for Lynx, which hardly anyone uses for actual browsing anymore) have a User-Agent string which begins with "Mozilla". Internet Explorer, Netscape and Opera all follow this convention. None of these browsers will ever request robots.txt in the normal course of their operation; of course, a user could explicitly request the file, but we can safely assume that normal browsers don't need to ask for it. Therefore, we use the User-Agent string to determine what is provided in robots.txt. To non-browsers (e.g. googlebot) we give the full file, including the warnings to avoid the spambot trap. To User-Agents which start with "Mozilla", we dynamically remove those warnings. This won't make any difference to everyday users, but a spambot which is attempting to masquerade as a real person now has no way of avoiding the trap - if it follows any links on the page at all then it will soon end up there. To accomplish this, I use Embperl to handle robots.txt, so that I can embed conditional code. Here's my new robots.txt:
User-agent: *
Disallow: /somedir/
Disallow: /some_other_dir/
[$ if $ENV{HTTP_USER_AGENT} !~ /^mozilla/i $]
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/post/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/message/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/email/
Disallow: /[+ $ENV{SPAMBOT_TRAP_DIR} +]/guestbook/contact/
[$ endif $]
Of course, it's still true that a spambot could request only pages
linked directly from google, and never actually follow any links from
our pages. This is obviously much harder to detect, but in my
experience the current generation of spambots do follow links,
and consequently are vulnerable to this approach. If spambots ever
became smarter than this, then we would just have to detect something
else, like the fact that they never load images and don't pass
cookies. As I said before, it's an arms race. Currently, spambots are
still pretty dumb, so this method will work. As long as most websites
out there are not employing spambot traps, it's not really worth the
trouble for the spambot authors to make them smarter.
Speaking of google, some people have mentioned that a spambot could in theory avoid the trap altogether by simply loading the pages that google has in its cache. This is an interesting point, but also easily solved. When serving pages to google (or, actually, any non-Mozilla user agent), simply remove all email addresses (even obfuscated ones) from the pages you serve. You could replace them with some text, or even a link to your real page. Of course, this requires a dynamic content filter such as mod_perl, Embperl or PHP, but it is a solution that works. The idea is that it really doesn't matter what pages the spambot gets from google, because there are no email addresses there at all, not even obfuscated ones that could potentially be decoded by a clever bot. Thus the problem becomes google's - it's now their web server resources being consumed, not yours. Problem solved. Alternatively, if you'd prefer that Google (and other sites) not cache your pages at all, then you can add the following tag to the HEAD section of your pages: <META NAME="ROBOTS" CONTENT="NOARCHIVE">The Google.com help for webmasters section has information for webmasters which includes this and other suggestions. The spambot trap has been working well on my system for a few months now, and is very stable. Here's the latest from my badhosts_log (see the previous section on badhosts_loop for interpreting the log): Sat Jun 29 10:05:36 2002: Flushing blocks chain: Generating blocks list: Adding 68.13.151.24 (15) 2002-06-07 20:45:45 to 2002-06-30 14:53:45 YUYURRSYAA Adding 24.100.224.110 (14) 2002-06-18 09:44:59 to 2002-06-29 18:48:59 YYCFAWZ Adding 24.101.97.21 (14) 2002-06-18 11:06:57 to 2002-06-29 20:10:57 RJJVAS Adding 68.4.200.220 (15) 2002-06-20 12:51:45 to 2002-07-13 06:59:45 HIBGMMPBNK Adding 12.226.164.219 (15) 2002-06-20 18:43:55 to 2002-07-13 12:51:55 LLJZSPJPBZKG Adding 66.176.44.203 (14) 2002-06-24 02:54:16 to 2002-07-05 11:58:16 VVDHDXDHUHR Adding 66.185.84.202 (15) 2002-06-24 11:27:31 to 2002-07-17 05:35:31 ZBBGUCDP Adding 24.101.39.246 (14) 2002-06-25 13:53:37 to 2002-07-06 22:57:37 WACLEDZYGTU Adding 24.120.185.130 (12) 2002-06-26 15:22:52 to 2002-06-29 11:38:52 DBrowse 1.4b Adding 208.6.163.83 (12) 2002-06-26 23:29:08 to 2002-06-29 19:45:08 DBrowse 1.4b Adding 68.5.169.46 (14) 2002-06-28 17:55:39 to 2002-07-10 02:59:39 ODXNZHX Adding 216.78.174.6 (11) 2002-06-28 21:44:14 to 2002-06-30 07:52:14 XUQSXFOVQABF Adding 211.101.236.91 (12) 2002-06-28 22:42:20 to 2002-07-01 18:58:20 Mozilla/3.0 (compatible; Indy Library) Adding 24.101.56.15 (15) 2002-06-29 00:05:57 to 2002-07-21 18:13:57 VWXTOGRAs you can see, the random-letter-generating spambots seem to be dominating the field now. There are others which appear all the time, two persistent examples of which are "JBH Agent 2.0" and "MFC Foundation Class Library 4.0". You need to keep an eye on the badhosts database to see what is falling into the trap all the time. If it's an easily identified user agent then you can just add that to the bad_agents.txt file, and block it altogether. Once you know something is bad, there's no reason to let it on your site at all - just give it 403, and block it immediately. I hope all this is useful, if only to serve as an example of how you can block spambots from your website. I'm always open to new ideas, and feedback from people who have implemented the spambot trap on their own systems. Let me know if there's anything I can do to help clarify the methods I have documented here. |
Update 4: October 15th 2005: Dealing with Googlebot 2.1, which pretends to be a browser |
Dealing with Googlebot 2.1, which pretends to be a browserGoogle has started experimenting with a new version of their crawler that pretends to be a browser. It uses something like the following User-Agent string:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)This presented a problem for my dynamic robots.txt, because I could no longer count on being able to distinguish between bots and browsers (even bona fide ones) by the User-Agent - remember that the convention has been for browsers to start their User-Agent string with "Mozilla/x.x (Compatible; ...". In the past, that has been a reliable way to distinguish browsers from bots (assuming they are playing by the rules). But now Google obviously wants their bot to be perceived as just another browser. So, the new Googlebot started falling into the spambot trap, because when it requested robots.txt, it was seen as a browser, and so it didn't get the spambot trap directories. So, we needed some way around this - how to allow "good" bots such as googlebot, while still fooling the spambots? The solution turns out to be that we simply have to be a bit smarter about where the bot is coming from. Pretty much the only way to tell if a bot is really from Google is if it comes from one of Google's subnets. After a little hunting around I managed to get a short list of the IP address ranges that Googlebot is likely to be coming from. Then, it was a question of recognising these ranges, and acting accordingly. So, here is robots.txt.new (using, as before, Embperl to embed Perl code in the file):
|
Update 5: October 21st 2006 - The Attack of the Botnet |
I noticed on Tuesday 17th that crazyguyonabike was catching a large number of spambots falling into the trap. Checking the server logs, it quickly became apparent that this was some kind of botnet. Here are some log snapshots showing what I'm talking about. It looks like a botnet because a) the offending hosts change very quickly, b) the ip addresses mostly resolve to dialup, DSL or cable addresses (i.e. ordinary computers) and c) all the User-Agent signatures are identical (thus suggesting that this is one program, spread over many different computers). So it's probably some kind of virus or worm that spreads through vulnerable Windows computers (at a guess), and then operates without the owner's knowledge.
The bots all come in directly to the guestbooks on the site, suggesting use of a search engine, and they have a prediliction for visiting the 'permalink' links. I noticed that they also try to post on the guestbooks, but are stymied by the preview step. Making the forms POST rather than GET makes no difference here, but they don't seem to have (yet) figured out how to go through a preview process. Out of curiosity I decided to capture whatever it was they were posting, and sure enough it turned out to be spam, linking to a website in several different ways (all of which are, I assume, different ways of marking up links on different blogging platforms). Here's an example (I have inserted spaces to prevent the links from working): http: // dsrrbaafefrfd. host. com desk3 [url=http: //dsrsbaafefrfd. host. com]desk4[/url] [link=http: //dsrabaafefrfd. host. com]desk6[/link]The links are different each time, I think they are generating random characters for the subdomain. When you go to anything.host.com, you can see it is some kind of link farm. When you do a whois on host.com, you get this: Visit AboutUs.org for more information about host.com AboutUs: host.com Registration Service Provided By: Web Development LLC Contact: admin@development.com Domain name: host.com Registrant Contact: Web Development LLC Administration Domain (admin@development.com) +1.8662635742 Fax: P.O. Box 570002 Whitestone, NY 11357 US Administrative Contact: Web Development LLC Administration Domain (admin@development.com) +1.8662635742 Fax: P.O. Box 570002 Whitestone, NY 11357 US Technical Contact: Web Development LLC Administration Domain (admin@development.com) +1.8662635742 Fax: P.O. Box 570002 Whitestone, NY 11357 US Status: Locked Name Servers: DMNS1.YAHOO.COM DMNS2.YAHOO.COM DMNS3.YAHOO.COM Creation date: 22 Aug 1994 00:00:00 Expiration date: 21 Aug 2008 00:00:00I tried writing to the email address (mikeb@hpnet.com) that is listed at www.aboutus.org/host.com, but I have no great hope of this doing any good. Finally, I have added a live badhosts snapshot which gives the current blocks taken directly from the database. |
Update 6: November 30th 2006 - Defending against botnets |
I have found that the botnet of spambots that has been hitting my site can easily clog up the firewall block list if I'm not careful. I am not sure how many blocks ip_tables can handle before it starts to affect CPU load, but 200 seems like a scary number of blocks. So I had to try to modify the trap to recognize the botnet and forbid it, but not block it.
Unfortunately I don't think it's a good idea for me to publish what I did here, because it only tips off the spambot authors as to what they have to do in order to circumvent my measures. Suffice to say, it is possible to spot patterns, and accordingly to forbid the bots, without adding everything to the firewall. This way you can stop them from falling in the trap and clogging up your drain en masse, while still stopping them from surfing your site. The battle continues. The big news lately has been about botnets being used to send spam via email. There has been little or no mention of botnets being used for spambot activity (i.e. trawling websites). This latest spambot botnet seems to be interested in posting spam on guestbooks and forums, rather than looking for email addresses. This presumably has to do with trying to post links to their websites, to boost their Google Pagerank. It seems to be an inevitable fact of life that for any good, useful, open system, there will appear people who try to exploit it for their own selfish gain, without a care for the consequences of their actions. No matter what you create, you must not only think about how it might be used, but also how it might be abused and turned on its head by the assholes of the world. Sad, but true. |
Update 7: August 2009: Spotting new botnets |
The spambot trap has been working well on my websites since 2001, and I've been very pleased with the results. The number of hosts in the blocklist has never grown beyond about 100 to 140 or so. For a long while now I have just let it do its thing, but recently I started getting some very odd registrations on my community websites (crazyguyonabike and topicwise). The registrations were characterised by the fact that the first and last names were obviously fake - e.g. "Rama Chandra RamaChandra". They looked like bots, so I investigated to see if I could spot any patterns to the requests. Sure enough, something immediately jumped out at me: In many cases, the referer was set to be identical to the request URI. This is not normal behavior for browsers. So I added a test for this, and have been catching quite a few instances of what seems to be a new type of bot. I checked out the activity on the IP addresses that were caught in this manner, and sure enough in every case they didn't seem to be "normal" users. Often there were multiple requests per second, to pages that were completely unrelated. Obviously bots of some kind. In any case, the spambot trap has started blocking them, which is a Good Thing in my book. Frankly it's shocking how many bad agents there are out there... without a tool like this spambot trap, your website really is at the mercy of all the assholes out there. I like having at least some control over who and what gets to trawl my sites and use up my bandwidth and resources.
|
Links |
Information Resources, Tips and How-to's |
The Web
Robots Pages - http://www.robotstxt.org/wc/robots.html
Spambot Beware - http://www.turnstep.com/Spambot/ Archive.org and Alexa.com -- Threats To Your Privacy - http://manly.delconet.com/klahn/privacy/index.html Defending Against Email Harvesters, Leechers, and Web Beacons - http://linux.oldcrank.com/tips/antibot/ Progressive IP Blocking - http://vamos-wentworth.org/bottrap/bottrap.html SpamHelp - http://www.spamhelp.org/ |
Organizations and Blocklists |
SpamCon
Foundation -
http://www.spamcon.org/
MAPS -
http://www.mail-abuse.org/
The Spamhaus Project - http://www.spamhaus.org/ Distributed Server Boycott List - http://www.dsbl.org/ ORDB.org - http://www.ordb.org/ EasyNet - http://abuse.easynet.nl/blackholes.html
|
Server-side tools |
Stopping
Spam and Malware with Open Source - http://www.brettglass.com/spam/paper.html
Robotcop.org - http://www.robotcop.org/ How to Defeat Bad Web Robots With Apache - http://www.leekillough.com/robots.html Wpoison - http://www.monkeys.com/wpoison/ Sugarplum - http://www.devin.com/sugarplum/ Using Apache to stop bad robots - http://www.evolt.org/article/Using_Apache_to_stop_bad_robots/18/15126/index.html Stopping Spambots II - The Admin Strikes Back - http://www.evolt.org/article/Stopping_Spambots_II_The_Admin_Strikes_Back/18/21392/ Deception Toolkit - http://www.all.net/dtk/ LaBrea - http://www.hackbusters.net/LaBrea/ Teergruben - http://www.iks-jena.de/mitarb/lutz/usenet/teergrube.en.html SpamCannibal - http://www.spamcannibal.org/ Project Honey Pot - http://www.projecthoneypot.org/ mod_spambot - http://spambot.sourceforge.net/ Bot-trap - A Bad Web-Robot Blocker - http://www.danielwebb.us/software/bot-trap/ Protect your web server from bad robots - http://www.rubyrobot.org/article/protect-your-web-server-from-spambots
|
Email obfuscation tools |
JavaScript
email encryptor - http://www.jracademy.com/~jtucek/eencrypt.html
EScrambler - http://innerpeace.org/escrambler.shtml Alicorna email obfuscator - http://alicorna.com/obfuscator.html www.gazingus.org - http://www.gazingus.org/ Railhead Design - http://www.railheaddesign.com/ Fantomaster Email obfuscator - http://fantomaster.com/fantomasSuite/mailShield/famshieldsv-e.cgi Avoiding the Spambots: An Email Encoder - http://www.metaprog.com/samples/encoder.htm mungeMaster Online - http://www.closetnoc.com/mungemaster/mungemaster.pl Email Obfuscator: A Tool For Webpages - http://sourceforge.net/projects/obfuscatortool |
Email filters |
SpamAssassin
- http://sourceforge.net/projects/spamassassin/
TMDA - http://software.libertine.org/tmda/ Vipul's Razor - http://razor.sourceforge.net/
|
Anti-Spam email services |
The Spam
Gourmet - http://www.spamgourmet.com/
Spamex - http://www.spamex.com/ Sneakemail - http://sneakemail.com/ Mailshell - http://www.mailshell.com/ Spam Motel - http://www.spammotel.com/ Spamcop - http://spamcop.net/ MailMoat - http://www.mailmoat.com/
|
Research projects |
Captcha -
http://www.captcha.net/
|
Novel ideas |
Curbside
Recording - http://www.curbside-recording.com/message.html
|
"Stopping Spambots: A Spambot Trap" Copyright © 2002-2021 By
Neil Gunton.
All rights reserved.
|