helloworld1990
helloworld1990

Reputation: 135

How can I pretend to be in a certain country during web scraping?

I want to scrape a website, but it should look like I am from a specific (let's say USA for this example) country (to make sure that my results are valid).

I am working in Python (Scrapy). And for scraping, I am using the rotating user agents (see: https://pypi.org/project/scrapy-fake-useragent-fix/).

The user agents are what I need to scrape. But can I use this, in combination with the request to pretend that I am in a specific country?

If there are some possibilities (in scrapy, Python) please let me know. Appreciated!

Example how I used the User Agents in my script

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}

Upvotes: 0

Views: 1511

Answers (3)

Ram Sankhavaram
Ram Sankhavaram

Reputation: 1228

Hello @helloworld1990,

Based on your requirement, say if you want to make each request using different IPs i.e. use IP Rotation (used when the site is detecting and blocking you after certain requests) then go for "Proxy Providers" there are many such providers you just need to google them.

If its not the case above, then for short term use you can try using TOR IPs. But TOR IPs are well known and are generally blocked. Else, you can still buy few static IPs from Proxy Providers and make the requests.

if(uniqueIpForEachRequestFromDifferentGeoLocations){
//go for proxy providers - IP Rotation
}else{
if(shortTermUse){
//go for tor nodes
}else{
//go for static IPs`enter code here`
}
}

Cheers! Hope this helps..

Upvotes: 0

Raphael
Raphael

Reputation: 1801

to pretent a certain country you need an IP from that country. Unfortunately this is nothing you can configure just by scrapy settings etc. But you could use a proxy service like crawlera:

https://support.scrapinghub.com/support/solutions/articles/22000188398-restricting-crawlera-ips-to-a-specific-region

Note: unfortunalty this service is not free and the cheapest plan is about 25 EUR. There are many other cheaper services available. The reason Crawlera is expensive is that they offer ban detection and only serve good IPs for your chosen domain. I've found them useful for the cost on Amazon and Google. Though on lesser domains a cheaper service with unlimited service would be more suitable.

Upvotes: 1

user2902481
user2902481

Reputation:

You can do this using Selenium (Don't know about Scrapy), First tell the bot to go to this site : Proxy Site

And then add your target site to search box and scrape .

Upvotes: 0

Related Questions