Eric Pierce
Eric Pierce

Reputation: 187

Prevent spider from racking up Tomcat sessions

I've got a fairly new website (~3 weeks old) running on Tomcat w/so far pretty low numbers of visitors.

For the last week I've noticed 1,000+ active sessions, and checking Tomcat's localhost_access* logs show that the overwhelming majority are coming from IPs in this range: 119.63.196.* which all look to belong to Baidu Japan.

Here's a small example from the logs of them hitting the front page. 119.63.196.107 - - [24/Aug/2011:07:02:46 +0000] "GET /;jsessionid=94085F76780ACFD96C8109A29446288D HTTP/1.1" 200 10311 119.63.196.44 - - [24/Aug/2011:07:03:21 +0000] "GET /;jsessionid=943133C77BB1756CF11592115BA81725 HTTP/1.1" 200 10333 119.63.196.39 - - [24/Aug/2011:07:03:56 +0000] "GET /;jsessionid=9B4384BDECF540C8628467F7AB4AB463 HTTP/1.1" 200 10311 119.63.196.19 - - [24/Aug/2011:07:04:31 +0000] "GET /;jsessionid=A0B555C3A18377D993B97D4491DD1012 HTTP/1.1" 200 10311 119.63.196.45 - - [24/Aug/2011:07:05:10 +0000] "GET /;jsessionid=A3782FA61558BF11C4D5AC4F3DD1EC86 HTTP/1.1" 200 10311 119.63.196.23 - - [24/Aug/2011:07:05:53 +0000] "GET /;jsessionid=A3AF84EF13F21492EB47FAB001A1C2E5 HTTP/1.1" 200 10311 119.63.196.120 - - [24/Aug/2011:07:06:31 +0000] "GET /;jsessionid=A7C490CEC2C7F2969772AC4050C6D761 HTTP/1.1" 200 10311 119.63.196.108 - - [24/Aug/2011:07:07:07 +0000] "GET /;jsessionid=A7F769D354CB37E99843292D650D6367 HTTP/1.1" 200 10311

No one individual IP is clobbering the site, but the collective requests from this IP range are racking up active sessions. And they seem to do it in somewhat of a coordinated fashion as one page at a time will get targeted and receive ~30 hits by ~30 different in the 119.63.196.* IP range over a 20 minute period. Then it'll move on to another page... and this is going on pretty much all day and racking up Tomcat sessions.

I do have inactive session timeout set pretty high (720 minutes), and maybe I need to bring that number down a lot. Maybe Baidu Japan is doing frequent checks because it thinks the page has changed due to a change in the link (i.e., the jsessionid is always different)?

Thanks for reading. I welcome any/all suggestions!

Eric

Upvotes: 3

Views: 1630

Answers (2)

Codo
Codo

Reputation: 78835

Tomcat 7 can prevent the creation of thousands of sessions if you configure the CrawlerSessionManagerValve. There's a short documentation.

In addition, you might want to consider to prevent Tomcat from putting the session ID into the URL because it would then show up in the search engines. Again starting with Tomcat 7, you can configure this:

<session-config>
   <tracking-mode>COOKIE</tracking-mode>
</session-config>

Upvotes: 5

BalusC
BalusC

Reputation: 1108722

Spiders do indeed usually not maintain a session with the website. That's normal. You should ask yourself if it is really necessary if your website creates a session upon a normal GET request. Sessions are usually used to store the logged-in user, its preferences such as locale, etcetera. But spiders do not login at all and they do not submit any forms at all. Why would you create the session then?

There are basically 2 ways to solve this "problem":

  1. Fix your website so that it doesn't unnecessarily create sessions as long as there's no need to. Create it only once an user logs in or creates/updates a sessionwide preference/variable. How exactly to do it depends on the APIs/frameworks used by your website.

  2. Block (specific) spiders by robots.txt.

Note that session creation and the session itself are not particularly expensive. An empty session object should not allocate more than 1KB. I find your session timeout however too high. The default of 30 minutes is already relatively a lot. As a completely different alternative, you could also set it to 5 minutes or something and introduce a JS/Ajax "heartbeat" which sends every timeout-1 minutes a poll request with the session cookie whenever the user is active on the document (click, keypress, etc). This would keep the session at the server alive. You can find an example in this answer.

Upvotes: 1

Related Questions