Reputation: 3885
I'm trying to make Instagram scraper with BeautifulSoup. I just want to get the name of the profile. (I'm using Jennifer Lopez profile) This is the code that I have:
import requests
from bs4 import BeautifulSoup
instagram_url = "https://www.instagram.com"
username = "jlo"
profile = instagram_url + "/" + username
response = requests.get(profile)
print(response.text)
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
name = bs_html('#react-root > section > main > div > header > section > div.-vDIg > h1')
print(name) #this should be Jennifer Lopez
Code works until print(response.text)
and it has error in if statement
This is the warning that I get:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml").
And I do not get the name.
Do you know what's the problem? I have also tried this. To download page, and in that way I have used .find
option and it works amazing (it works for every profile), but when I try to do this with link, it does not work.
Is there a way to do this without using Selenium
?
from urllib.request import urlopen
from bs4 import BeautifulSoup
#this works
with open('Jennifer.html', encoding = 'utf-8') as html:
bs = BeautifulSoup(html, 'lxml')
name = bs.find('h1', class_='rhpdm')
name = str(name).split(">")[1].split("<")[0]
print(name)
#this does not work
html = urlopen('https://www.instagram.com/jlo/')
bs = BeautifulSoup(html, 'lxml')
name = bs.find('h1', class_='rhpdm')
print(name)
Upvotes: 1
Views: 598
Reputation: 33384
Scripts using selenium
Chrome driver
.
You can download compatible chrome driver from this link Check your chrome web browser version and download the compatible chrome driver version from above link.
from bs4 import BeautifulSoup
from selenium import webdriver
instagram_url = "https://www.instagram.com"
username = "jlo"
profile = instagram_url + "/" + username
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver=webdriver.Chrome('D:\chromedriver.exe',chrome_options=chrome_options)
driver.get(profile)
html=driver.page_source
driver.close()
soup=BeautifulSoup(html,'html.parser')
print(soup.select_one('.rhpdm').text)
Upvotes: 2
Reputation: 1373
Here you go! You can do it like this.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
binary = r'C:\Program Files\Mozilla Firefox\firefox.exe' #this should be same if using windows
options = Options()
options.set_headless(headless=True)
options.binary = binary
cap = DesiredCapabilities().FIREFOX
cap["marionette"] = True #optional
driver = webdriver.Firefox(firefox_options=options, capabilities=cap, executable_path=r'Your Path') #put your geckodriver path here
#Above code should be the same for most of the time when you scrape.
#Below is the place where you will be making changes
instagram_url = "https://www.instagram.com"
username = "jlo"
profile = instagram_url + "/" + username
driver.get(profile)
soup=BeautifulSoup(driver.page_source)
for x in soup.findAll('h1',{'class':'rhpdm'}):
print(x.text.strip())
driver.quit()
Instructions for downloading geckodriver is here
Upvotes: 0