Being Evil

Someone was using SiteSucker on a demo site and was behaving badly, this site also hosts our redmine install on the same DB and started to slow down my work. To block we had a few options, iptables the IP, block the IP w/in apache (htaccess) or do the evil thing, just block based on their user-agent. I chose the latter.

RewriteEngine on
...
RewriteCond %{HTTP_USER_AGENT} ^SiteSucker2.3.1
RewriteRule ^(.*) /badbot.html

This is evil since the persons crawler will get out badbot.html page. The person will then use firefox/ie/whatever to browse to the page and look to see why it’s not working, but since it’s sending a differend user-agent, they will be allowed to browse. For anyone who knows how to configure a crawler it isn’t an issue changing the supplied agent, but then again that person will likely be able to control their crawler and not kill my web server. Here’s a snapshot of the logs showing the crawler pulling badbot (size 174) followed by a browse attempt from safari.

82.232.60.250 - - [09/Mar/2010:19:52:41 -0500] "GET /cgi-bin/isadg/viewitem.pl?item=14379 HTTP/1.1" 200 174 "http://narademo.umiacs.umd.edu/cgi-bin/isadg/viewseries.pl?seriesid=1054" "SiteSucker2.3.1 CFNetwork/438.14 Darwin/9.8.0 (i386) (MacBook1%2C1)"
82.232.60.250 - - [09/Mar/2010:19:52:42 -0500] "GET /cgi-bin/isadg/viewitem.pl?item=14377 HTTP/1.1" 200 174 "http://narademo.umiacs.umd.edu/cgi-bin/isadg/viewseries.pl?seriesid=1054" "SiteSucker2.3.1 CFNetwork/438.14 Darwin/9.8.0 (i386) (MacBook1%2C1)"
82.232.60.250 - - [10/Mar/2010:11:03:13 -0500] "GET /cgi-bin/isadg/viewrg.pl?rgid=44 HTTP/1.1" 200 2286 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; fr-fr) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/531.21.10"