Following up on our recent Robots.txt Builder Tool announcement, I want to talk a bit about how to deal with robots that do not follow the Robots Exclusion standard. I’m sure at least some of us are familliar with the tale of Brett Tabke and his open warfare on robots hammering Webmaster World. I’m not going to go in to it, but he largely solved his problem with rutheless use of Honeypots/Spider Traps.
The basic premise is this:
Our attack has two distinct sections:
To do this, we’ll be creating hidden links around our site and deny access to their destination with a /robots.txt directive. We will then be storing IPs of the bad robots for later use.
As usual for my posts on David Naylor we’ll be assuming a Linux, Apache, MySQL and PHP (LAMP) setup. However, the technique is really quite simple and is easily adaptable to your stack of choice.
Okay, so we need a link on your site which is visible to spiders and not search engines. Matt gives a great tutorial on how to do it on his blog. This is technically cloaking but Google says it’s okay so we’re going to plough right ahead.
What we’re going to do is create a link that isn’t visible to humans, but one that a robot would pick up easily. The anchor text should be invisible, but should someone read the source of use a weird browser it should warn the visitor not to click it. After all, if they do they’ll get banned.
I’m not going to give you precise instructions on this because we want to avoid botwriters using heuristics to avoid honey traps. However, here’s some tips:
Rememeber, your link needs some content inside it otherwise most HTML parsers will skip over it.
This bit is really easy. You need to create a robots.txt file inside the root of your website (that is, the top-level directory) which disallows access to the URL you chose. For example, if I decided my link should point to /badboy.php
, my robots.txt file would look like:
User-agent: *
Disallow: /badboy.php
You can even use our Robots.txt Builder Tool to help you with this.
Any well-behaved bots should never access /badboy.php from now on. Make sure you upload your robots.txt file before you implement the next section.
I’m going to refer to our link (eg. /badboy.php
) as the spider trap. The rest of this tutorial will refer to /badboy.php
but please do not use this yourself.
Okay so now you want to make your spider trap. Create the page /badboy.php and open it up in your favourite code editor.
Our PHP for this is really simple, we’re just storing some environment variables in a database. I’m going to assume you can go through the rigmarole of connecting to a database and managing XSS attacks properly yourself. We should probably log a bit more than just the IPs of the bots. I also want to store their User-agent and the datetime that they visited:
< ?php
require_once("DB.php");
$db = DB::connect("mysql://user:pass@localhost/database");
if (PEAR::isError($db)) die("Could not connect to database");
// if you don't know what PEAR::DB is I suggest you find out!
$db->query("insert into badrobots set ip=?, useragent=?, datetime=!",
Array($_SERVER['REMOTE_ADDR'], $_SERVER['HTTP_USER_AGENT'], "now()"));
echo "You're nicked, son.";
?>
Don’t forget to add an index on that ip
column in your table.
Now the bad bots will visit this page and get their IP logged. Hurrah!
So now we want to actually ban our bad bots. This isn’t actually as simple as it sounds. Basically, we have three options:
I’m going to discuss option #1 in this tutorial. It’s not the best option but it is easily the simplest. You see, with option #1, our server is still accepting the request and firing up a PHP interpreter before the connection is rejected. We’ve also had to connect to a DB and do a read on it. However, both the other options won’t interface with a DB so require manually adding the rules or compiling them periodically. Worse, option #3 could end up with you completely unable to access your own server if it goes tits up. However, it is the only option that will protect your server from a monumental hammering.
Anyway, banning the bots with #1 is dead easy. All you need to do is make sure this following bit of PHP code is execute at the start of every page on your site, as soon after you connect to your database as possible. My DB syntax might be different to yours, but as an experienced website operator I’m sure you can translate, right?
< ?php
// connect to DB, etc
if ($db->getOne("select count(1) from badrobots where ip=?", Array($_SERVER['REMOTE_ADDR'])))
die('
You have been banned from this site for poor robot behaviour. If you think this is in error please contact the server administrator<.
');
?>
And there you have it! You might also want to log bad robot accesses but.. I dunno, up to you.
And that brings us to the end of our tutorial. I hope you enjoyed it! All comments, suggestions and errata to the usual place.
Congratulations to Richard Hearne for being the first to suggest how I would better store an IP in a MySQL database. However, he neglected to mention that ip2long returns the IP as a signed int and needs to be converted with sprintf. Johannes suggested my favourite method of using MySQL's built-in INET_ATON and INET_NTOA functions.
Like what you’ve read, then why not tell others about it... they might enjoy it too
If you think Bronco has the skills to take your business forward then what are you waiting for?
Get in Touch Today!
Hmmm…. free link…..
ip2long($_SERVER[‘REMOTE_ADDR’])
might be better way to store the IP.
Or you could use MySQL INET_ATON()
Maybe there should be a blacklist of known bad robots? Something like the email black list programs? Something that will check a global website, where the bad robots get voted on to get in, and members can download updated automatically.
Use the mysql function INET_ATON to convert the ip to an unsigend int and store that. 4 compared to 15 byte. Other DBMBs like postgres got special features for storing ips/subnets.
Well, I presume that in the footnote you’re getting at the point that you should store the IP as a 32-bit unsigned int, taking up 4 bytes instead of as a varchar(15) wasting 11 bytes per row.
For those who prefer to trap using just robots.txt, .htaccess and php (i.e. without needing a db), here’s an older-school method: http://www.webmasterworld.com/forum88/3104.htm
Adding in a little email alert that lets one know when one snares something is something I’ve always found handy.
And, to expand on NevDull’s explanation, you could use PHP’s ip2long() and long2ip() functions to convert the IP address to and from its integer form.
Not only is a varchar field less space efficient than an int, it is will also harm the performance of any selects on that column (as in your checking query).
IP can be a double and stored / retrieved using inet_aton.
Select count(whatever) from baddies_tbl
where ip=
Damn cut and paste!
inet_aton(‘$REMOTE_ADDR’)
So – how to create the link and not be banned ny Google ?
Matt Cutts will do Ctrl-A, see the link and ban 🙂
There is another option for dealing with the bad bots. You can use apache’s mod_security against them.
E.g. SecFilterSelective REMOTE_ADDR “^x.x.x.x” “log,redirect:http://x.x.x.x”
That way, every time they try to crawl your site, they’ll be redirected back on themselves, most likely getting a boatload of 404s.
Hey, if I visit
http://fusion.google.com/add?feedurl=http://www.davidnaylor.co.uk/badrobots.html
won’t google get blocked?
Same with google translate etc.
And you can not maintain a white list. Coz bots can request the files through google.
Anatoly, surely Google doesn’t ban you for having invisible links to your own site?
might be nice to issue a 401 or similar, too, with the getOne successful call
They often stop coming back when you do that enough, and thats always a pleasant side effect.
And now I have ‘bad bots, bad bots, what you gonna do?’ stuck in my head. Thank *you* Mr Naylor! 😉
Kishor: I don’t think so – your visit will use your IP and Google will (surely) check the robots before they try and visit.
Ah, no
Look at
http://groups.google.co.in/robots.txt
(Disallow: /news?output=xhtml&)
And try translate on
http://groups.google.co.in/news?output=xhtml&
Like this:
http://translate.google.com/translate?u=http%3A%2F%2Fgroups.google.co.in%2Fnews%3Foutput%3Dxhtml%26&langpair=es%7Cen&hl=en&ie=UTF-8&oe=UTF-8&prev=%2Flanguage_tools
And google fetches that for you!
Kishor you’re right, I can confirm that behaviour.
I wonder if that’s completely Kosher. Should Google’s translate obey the rules for robots?
Okay mate now where is the load balancing info? So the spiders can be told when best to visit the site…
I will promote the crap out of it but give me some more widgets.
I think when the translator is translating, its no more a bot. It does it on user’s request.
But it can be argued that attackers will use translate to crawl pages and then it becomes a BOT.
Now the right way to avoid it would be to do BOT detection on google side and block such abuse. (IMO)
we use a compiled code component that runs our session id cloaking. Basically we use this to 404 error anything that isn’t 1) a user 2) a known good bot. We use the rss feed from http://www.user-agents.org/index.shtml to update our database every day. Obviously non compliant bots aren’t going to record the user agent, so if we don’t know you (or if we do but we know you’re bad) then GET LOST DUDE!
Oooh, and one final caveat! Make sure you put the robots entry in a good while before you put the link up.
I don’t know what the delay is, but I would go for 24 hours as I would hate to trap a robot I want, like Googlebot.
Why? So that they aren’t happily crawling your pages between robots calls and go and get the honeypot link. 🙁
richB: I wonder why a bad bot would tell you that its bad.
You can go further – feed randomised nonsense to bad bots. Instead of 404-ing them, lead them down a path of entirely spurious pages with links that lead to recursive pages. Chews up their CPU time, prevents the bot from crawling somewhere else productively, and consumes miniscule server resources. You can also stuff these pages with fake form submissions and CAPTCHA, to trap the blog/email form spam bots.
We’ve done some experiments with generating randomised link trails for the bots to follow… You can certainly get some bots to follow the trail a long way, presumably until some recursion limit or stack limit is reached. Hmm, now there’s a nasty idea… could one get a bad bot to execute a stack attack, thereby creating a goodbot on a badbot net? 🙂
Neat. See my (simple) version that incorporates an .htacess file for comparison
Its at: http://seven-3-five.blogspot.com/2006/09/simple-php-based-bad-bot-trap_04.html
cheers
Thanks for the tips from this day i ban bad robots 😀
Has anyone made a WordPress plugin that does this with a 1 click install?