Reputation: 19
I currently working on the HTML scraping the baka-update. However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored..... Is their are way to scrape with this kinds of website?
Thanks,
Sample https://www.mangaupdates.com/series.html?id=75363
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[@class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[@class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of @Jasper Nichol M Fabella
Upvotes: 1
Views: 397
Reputation: 723
What are you using to scrape? If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.
Upvotes: 0
Reputation: 1150
Here is an example with requests
and lxml
library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[@class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[@class="sContent"]')]
Upvotes: 0
Reputation: 760
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[@class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[@class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
Upvotes: 1
Reputation: 460
you can use xpath expressions and create an absolute path on what you want to scrape
Upvotes: 0