Reputation: 4598
Pretty sure this question counts as blasphemy to most web 2.0 proponents, but I do think there are times when you could possibly not want pieces of your site being easily ripped off into someone else's arbitrary web aggregator. At least enough so they'd need to be arsed to do it by hand if they really wanted it.
My idea was to make a script that positioned text nodes by absolute coordinates in the order they'd appear normally within their respective paragraphs, but then stored those text nodes in a random, jumbled up order in the DOM. Of course, getting a system like that to work properly (proper text wrap, alignment, styling, etc.) seems almost akin to writing my own document renderer from scratch.
I was also thinking of combining that with a CAPTCHA-like thing to muss up the text in subtle ways so as to hinder screen scrapers that could simply look at snapshots and discern letters or whatnot. But that's probably overthinking it.
Hmm. Has anyone yet devised any good methods for doing something like this?
Upvotes: 3
Views: 998
Reputation: 806
Few of these techniques will stop the determined. Alexa-style garbage-HTML/CSS-masking is easy to get around (just parse the CSS); AJAX/Javascript-DOM-insertion is easy to get around as well, although form authenticity tokens make this harder.
I've found providing an official API to be the best deterrent :)
Barring that, rendering text into an image is a good way to stop the casual scraper (but also still doable)
YouTube also uses javascript obfuscation that makes AJAX reverse engineering more difficult
Upvotes: 0
Reputation: 4728
To understand this it is best to attempt to scrape a few sites. I have scraped some pretty challenging sites like banking sites. I've seen many attempts at making scraping difficult (e.g. encryption, cookies, etc). At the end of the day the best defense is unpredictable markup. Scrapers rely most heavily on being able to fing "patterns" in the markup. The moment the pattern changes, the scraping logic fails. Scrapers are notoriously brittle and often break down easily.
My suggestion, randomly inject non-visible markup into your code. In particular around content that is likely to be interesting. Do anything you can think of to make your markup look different to a scraper each time it is invoked.
Upvotes: 1
Reputation:
Alexa.com does some wacky stuff to prevent scraping. Go here and look at the traffic rank number http://www.alexa.com/data/details/traffic_details/teenormous.com
Upvotes: 0
Reputation: 5698
I've seen a TV guide decrypt using javascript on the client side. It wouldn't stop a determined scraper but would stop most casual scripting.
All the textual TV entries are similar ps10825('4VUknMERbnt0OAP3klgpmjs....abd26')
where ps10825 is simply a function that calls their decrypt function with a key of ps10825. Obviously the key is generate each time.
In this case i think it's quite adequate to stop 99% of people using Greasemonkey or even wget scripts to download their TV guide without seeing all of their adverts.
Upvotes: 3
Reputation: 7435
Please don't use absolute positioning to reassemble a scrambled page. This won't work for mobile devices, screen readers for the visually impaired, and search engines.
Please don't add captcha. It will just drive people away before they ever see your site.
Any solution you come up with will be anti-web. The Internet is about sharing, and you have to take the bad with the good.
If you must do something, you might want to just use Flash. I haven't seen link farmers grabbing Flash content, yet. But for all the reasons stated in the first paragraph, Flash is anti-web.
Upvotes: 4
Reputation: 86805
Consider that everything that the scraper can't read, search engines can't read either. With that been said, you could inject content into your document via Javascript after the page has loaded.
Upvotes: 6
Reputation: 53310
Your ideas would probably break any screen-readers as well, so you should check accessibility requirements/legislation before messing up ordering.
Upvotes: 3
Reputation: 27670
Just load all your HTML via AJAX calls and the HTML will not "appear" to be in the DOM to most screen scrapers.
Upvotes: -1