taga
taga

Reputation: 3885

Scrape Instagram names with BeautifulSoup in Python

I'm trying to make Instagram scraper with BeautifulSoup. I just want to get the name of the profile. (I'm using Jennifer Lopez profile) This is the code that I have:

import requests
from bs4 import BeautifulSoup


instagram_url = "https://www.instagram.com"
username = "jlo"

profile = instagram_url + "/" + username

response = requests.get(profile)
print(response.text)

if response.ok:
    html = response.text
    bs_html = BeautifulSoup(html)
    name = bs_html('#react-root > section > main > div > header > section > div.-vDIg > h1')
    print(name) #this should be Jennifer Lopez

Code works until print(response.text) and it has error in if statement

This is the warning that I get:

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml").

And I do not get the name.

Do you know what's the problem? I have also tried this. To download page, and in that way I have used .find option and it works amazing (it works for every profile), but when I try to do this with link, it does not work.

Is there a way to do this without using Selenium?

from urllib.request import urlopen
from bs4 import BeautifulSoup

#this works

with open('Jennifer.html', encoding = 'utf-8') as html:
    bs = BeautifulSoup(html, 'lxml')

name = bs.find('h1', class_='rhpdm')
name = str(name).split(">")[1].split("<")[0]
print(name)


#this does not work

html = urlopen('https://www.instagram.com/jlo/')
bs = BeautifulSoup(html, 'lxml')

name = bs.find('h1', class_='rhpdm')
print(name)

Upvotes: 1

Views: 598

Answers (2)

KunduK
KunduK

Reputation: 33384

Scripts using selenium Chrome driver. You can download compatible chrome driver from this link Check your chrome web browser version and download the compatible chrome driver version from above link.

from bs4 import BeautifulSoup
from selenium import webdriver

instagram_url = "https://www.instagram.com"
username = "jlo"
profile = instagram_url + "/" + username
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver=webdriver.Chrome('D:\chromedriver.exe',chrome_options=chrome_options)
driver.get(profile)
html=driver.page_source
driver.close()
soup=BeautifulSoup(html,'html.parser')
print(soup.select_one('.rhpdm').text)

Upvotes: 2

Kartikeya Sharma
Kartikeya Sharma

Reputation: 1373

Here you go! You can do it like this.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


binary = r'C:\Program Files\Mozilla Firefox\firefox.exe' #this should be same if using windows
options = Options()
options.set_headless(headless=True)
options.binary = binary
cap = DesiredCapabilities().FIREFOX
cap["marionette"] = True #optional
driver = webdriver.Firefox(firefox_options=options, capabilities=cap, executable_path=r'Your Path') #put your geckodriver path here

#Above code should be the same for most of the time when you scrape.
#Below is the place where you will be making changes

instagram_url = "https://www.instagram.com"
username = "jlo"
profile = instagram_url + "/" + username

driver.get(profile)
soup=BeautifulSoup(driver.page_source)
for x in soup.findAll('h1',{'class':'rhpdm'}):
    print(x.text.strip())
driver.quit()

Instructions for downloading geckodriver is here

Upvotes: 0

Related Questions