Karim Bn Abdlaziz
Karim Bn Abdlaziz

Reputation: 57

web scraping info and printing it in a csv file

I need to parse info from a HTML file with Python (beautifulsoup or scrapy), then print it into a csv file. The relevant info is the file names and number of times seen in my account, here.

Relevant HTML concerning number of times:

<div class="hidden-tiles views C C1">
      <nobr class="hidden-xs">num </nobr>
      <nobr class="hidden-sm hidden-md hidden-lg">num</nobr>
</div>

Relevant HTML for file names:

<div class="ttl">
       {filename}
</div>

what i was able to do:

import requests  
page = requests.get("https://archive.org/details  /%40kareem76?&sort=-publicdate&page=2")  
page  
page.content  
nbr = BeautifulSoup(page.content, 'html.parser')  
nbr.find_all('div', class_='hidden-tiles views C C1')

Upvotes: 0

Views: 214

Answers (2)

dabingsou
dabingsou

Reputation: 2469

Maybe this is another solution.

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://archive.org/details/@kareem76?&sort=-publicdate&page=2'
html = req.get(url)
doc = SimplifiedDoc(html)
blocks = doc.selects('div.results>div.item-ia').notContains(['mobile-header','hidden-tiles','collection-ia'],attr='class')
for block in blocks:
  nums = block.selects('div.hidden-tiles views C C1>nobr>text()')
  title = block.select('div.ttl>text()')
  print (title, nums[0],nums[1])

Result:

ننتصر او ننتصر من اجل الربيع العربي المنصف المرزوقي 1,056 1.1K
الرحلة مذكرات آدمي المنصف المرزوقي ط.مزيدة و منقحة 874 874
الثورة التونسية المجيدة، بنية ثورة وصيرورتها من خلال يومياتها عزمي بشارة الطبعة الثانية 469 469
The Case For Impeachment Allan J. Lichtman 65 65
CONTRAT ASSURANCE CREDIT MACRON ALLIANZ 137 137
...

Upvotes: 3

sentence
sentence

Reputation: 8923

This code should do the job:

import requests  
from bs4 import BeautifulSoup
import pandas as pd


html = requests.get("https://archive.org/details/@kareem76").text

soup = BeautifulSoup(html, 'html.parser')  
titles = [i.text.strip() for i in soup.find_all('div', class_='ttl')]
views = [i.find('nobr').text for i in soup.find_all('div', class_='hidden-tiles views C C1')]

df = pd.DataFrame({'titles':titles,
                  'views':views})


df.to_csv("titles-views.csv",
          mode='w',
          index = None,
          header=True)

and you get (just an excerpt):

enter image description here

Upvotes: 2

Related Questions