Onlyfood
Onlyfood

Reputation: 137

lxml returned me a list but it's empty

I was trying to make a list of all the top 1000 instagramer's acount from this website:'https://hypeauditor.com/top-instagram/'. The list that returns from lxml is empty for both lxml.html and lxml.etree.

I tried to delete tbody, delete text(), and upper xpath, but it all failed. what worth noticing is that, with upper xpath, it did return me something, but it is all but /n.

I first tried lxml.etree

market_url='https://hypeauditor.com/top-instagram/'
r_market=requests.get(market_url)
s_market=etree.HTML(r_market)`
file_market=s_market.xpath('//*[@id="bloggers-top-table"]/tr[1]/td[3]/a/text()')

then I also tried lxml.html.

tree=html.fromstring(r_market.content)
result=tree.xpath('//*[@id="bloggers-top-table"]/tr/td/h4/text()')

furthermore, I tried this xpath:

s_market.xpath('//*[@id="bloggers-top-table"]/tbody/text()')

It did not give me any error. But after all the attempts, it still gives me wether empty list or a list full of n/.

I am not really experienced in web scraping so it is possible that I have just made a stupid error somewhere, but since without the data I can not start my machine learning model, I am really struggling, pls help.

Upvotes: 2

Views: 822

Answers (3)

Thomas Hayes
Thomas Hayes

Reputation: 102

An easier way to do this would be to use pandas. It can read simple HTML Tables like this no problem. Try the following code to scrape the whole table.

import pandas as pd

df = pd.read_html('https://hypeauditor.com/top-instagram/')

Upvotes: 2

QHarr
QHarr

Reputation: 84465

Here is a more lightweight way of getting just that column using nth-of-type. You should find this faster.

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://hypeauditor.com/top-instagram/')
soup = bs(r.content, 'lxml')
accounts = [item.text.strip().split('\n') for item in soup.select('#bloggers-top-table td:nth-of-type(4)')][1:]
print(accounts)

Upvotes: 2

Yaakov Bressler
Yaakov Bressler

Reputation: 12018

You will definitely want to get acquainted with the package BeautifulSoup which allows you navigate a web page's content in python.

Using BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = 'https://hypeauditor.com/top-instagram/'
r = requests.get(url)
html = r.text

soup = BeautifulSoup(html, 'html.parser')

top_bloggers = soup.find('table', id="bloggers-top-table")
table_body = top_bloggers.find('tbody')
rows = table_body.find_all('tr')

# For all data:
# Will retrieve a list of lists, good for inputting to pandas

data=[]

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values


# For just handles:
# Will retrieve a list of handles, only

handles=[]

for row in rows:
    cols = row.find_all('td')
    values = cols[3].text.strip().split('\n')
    handles.append(values[-1])

The for loop I use for rows is sourced from this answer

Upvotes: 3

Related Questions