Keiro Kamioka
Keiro Kamioka

Reputation: 19

HTML Scraping the website with duplicated div class name

I currently working on the HTML scraping the baka-update. However, the name of Div Class is duplicated.

As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored..... Is their are way to scrape with this kinds of website?

Thanks,

Sample https://www.mangaupdates.com/series.html?id=75363

Image 1 enter image description here Image 2 enter image description here

from lxml import html
import requests

page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)

#Get the name of the columns.... I hope
sCat = tree.xpath('//div[@class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[@class="sContent"]/text()')

print('sCat: ', sCat)
print('sContent: ', sContent)

I tried but nothing I could find of @Jasper Nichol M Fabella

enter image description here

Upvotes: 1

Views: 397

Answers (4)

LFMekz
LFMekz

Reputation: 723

What are you using to scrape? If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator

Something like

import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here

Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.

Upvotes: 0

Harish Vutukuri
Harish Vutukuri

Reputation: 1150

Here is an example with requests and lxml library:

from lxml import html
import requests

r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)

sCat = [i.text_content().strip() for i in tree.xpath('//div[@class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[@class="sContent"]')]

Upvotes: 0

I tried to edit your code and got the following output. Maybe it will Help.


from lxml import html
import requests

page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)

#Get the name of the columns.... I hope
sCat = tree.xpath('//div[@class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[@class="sContent"]')

print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}

for i in  range(0,len(sCat)):
#     print(''.join(i.itertext()))
    sCat_text=(''.join(sCat[i].itertext()))
    sContent_text=(''.join(sContent[i].itertext()))
    json_dict[sCat_text]=sContent_text
print(json_dict)


I got the following output

enter image description here

Hope it Helps

Upvotes: 1

you can use xpath expressions and create an absolute path on what you want to scrape

Upvotes: 0

Related Questions