Reputation: 57
I need to parse info from a HTML file with Python (beautifulsoup or scrapy), then print it into a csv file. The relevant info is the file names and number of times seen in my account, here.
Relevant HTML concerning number of times:
<div class="hidden-tiles views C C1">
<nobr class="hidden-xs">num </nobr>
<nobr class="hidden-sm hidden-md hidden-lg">num</nobr>
</div>
Relevant HTML for file names:
<div class="ttl">
{filename}
</div>
what i was able to do:
import requests
page = requests.get("https://archive.org/details /%40kareem76?&sort=-publicdate&page=2")
page
page.content
nbr = BeautifulSoup(page.content, 'html.parser')
nbr.find_all('div', class_='hidden-tiles views C C1')
Upvotes: 0
Views: 214
Reputation: 2469
Maybe this is another solution.
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://archive.org/details/@kareem76?&sort=-publicdate&page=2'
html = req.get(url)
doc = SimplifiedDoc(html)
blocks = doc.selects('div.results>div.item-ia').notContains(['mobile-header','hidden-tiles','collection-ia'],attr='class')
for block in blocks:
nums = block.selects('div.hidden-tiles views C C1>nobr>text()')
title = block.select('div.ttl>text()')
print (title, nums[0],nums[1])
Result:
ننتصر او ننتصر من اجل الربيع العربي المنصف المرزوقي 1,056 1.1K
الرحلة مذكرات آدمي المنصف المرزوقي ط.مزيدة و منقحة 874 874
الثورة التونسية المجيدة، بنية ثورة وصيرورتها من خلال يومياتها عزمي بشارة الطبعة الثانية 469 469
The Case For Impeachment Allan J. Lichtman 65 65
CONTRAT ASSURANCE CREDIT MACRON ALLIANZ 137 137
...
Upvotes: 3
Reputation: 8923
This code should do the job:
import requests
from bs4 import BeautifulSoup
import pandas as pd
html = requests.get("https://archive.org/details/@kareem76").text
soup = BeautifulSoup(html, 'html.parser')
titles = [i.text.strip() for i in soup.find_all('div', class_='ttl')]
views = [i.find('nobr').text for i in soup.find_all('div', class_='hidden-tiles views C C1')]
df = pd.DataFrame({'titles':titles,
'views':views})
df.to_csv("titles-views.csv",
mode='w',
index = None,
header=True)
and you get (just an excerpt):
Upvotes: 2