Reputation: 535
The following code is working on a computer to scrape data from Instagram account. When I try to use it on a VPS server I'm redirected to the Instagram Login page so the script doesn't work.
Why does Instagram doesn't react the same way when I'm on a computer or on a server ?
It's the same with wget. On a computer I have the profile page, on a server I'm redirected to the login page.
import requests
import re
class InstagramScraper:
"""
Scraper of Instagram profiles infos.
"""
def __init__(self, session: requests.Session, instagram_account_name: str):
self.session = session
self._account_name = self.clean_account_name(instagram_account_name)
self.load_data()
def load_data(self):
#print(self._account_name)
response = self.session.get("https://www.instagram.com/{account_name}/".format(account_name=self._account_name))
#print(response)
#print(response.text)
publications_regex = r'"edge_owner_to_timeline_media":{"count":(\d*),'
self._publications = re.search(publications_regex, response.text).group(1)
followers_regex = r'"edge_followed_by":{"count":(\d*)'
self._followers = re.search(followers_regex, response.text).group(1)
# title_regex = r'"@type":".*","name":"(.*)",'
title_regex = r'"full_name":"(.*)",'
self._title = re.search(title_regex, response.text).group(1)
self._title = self._title.split('\"')[0]
following_regex = r'"edge_follow":{"count":(\d*)}'
self._following = re.search(following_regex, response.text).group(1)
def clean_account_name(self, value) -> str:
"""
Return the account name without the url address.
"""
found: str = re.search("https://www.instagram.com/(.*)/", value)
if found:
return found.group(1)
return value
@property
def publications(self) -> int:
"""
Number of publications by this account.
"""
return self._publications
@property
def followers(self) -> int:
"""
Number of followers of this account.
"""
return self._followers
@property
def title(self) -> str:
"""
Name of the Instagram profile.
"""
return self._title
@property
def account(self) -> str:
"""
Account name used on Instagram.
"""
return self._account_name
@property
def following(self) -> int:
"""
Number of accounts this profile is following.
"""
return self._following
def __str__(self) -> str:
return str({
'Account': self.account,
'Followers': self.followers,
'Publications': self.publications,
'Following': self.following,
'Title': self.title,
})
if __name__ == "__main__":
with requests.session() as session:
scraper = InstagramScraper(session, "https://www.instagram.com/ksc_lokeren/")
print(scraper)
Upvotes: 3
Views: 5653
Reputation: 801
You see login prompt from Instagram because you are being blocked. Instagram detects that you are not manually browsing their website.
If you want to extract info for Instagram profile you have to rely on an API for scraping since Instagram will block you very quickly.
Here is a good tutorial on scraping user profile data and posts that handles pagination using an API for scraping: https://scrapingfish.com/blog/scraping-instagram
Upvotes: 1
Reputation: 212
It might be because you are logged in with your own credentials on your computer? furas mentioned a blacklist, but if you've never ran it on this server before, I doubt it.
What i was able to do to avoid that is use a headless browser, which simulates a normal browser and lets you navigate on websites. You would simulate a login with your credentials, then retrieve the csrftoken and sessionid from the cookies and close the browser.
I did mine in javascript so I can't really show it to you, but the logic is this one :
Create your headless browser
Set the 'accept-language' header of your request to 'en-US'
Navigate to https://www.instagram.com/accounts/login/. Wait until idle
Emulate the sign-in with your credentials. Look for :
'input[name="password"]' //for the password.
'input[name="username"]' //for username.
'button[type="submit"]' //for the login button
Wait until idle
Get all cookies and retrieve the csrftoken and sessionid
Close the headless browser
Then, when doing any request to https://www.instagram.com/{account_name}/
, don't forget to set the csrftoken and sessionid cookies in your request header. After a while it will expire, you'll need to restart
Upvotes: 1