Isak
Isak

Reputation: 545

Python, BeautifulSoup code seems to work, but no data in the CSV?

I have about 500 html files in a directory, and I want to extract data from them and save the results in a CSV.

The code I'm using doesn't get any error messages, and seems to be scanning all the files, but the resulting CSV is empty except for the top row.

I'm fairly new to python and I'm clearly doing something wrong. I hope someone out there can help!

from bs4 import BeautifulSoup
import csv
import urllib2
import os

def processData( pageFile ):
    f = open(pageFile, "r")
    page = f.read()
    f.close()
    soup = BeautifulSoup(page)

    metaData = soup.find_all('div class="item_details"')
    priceData = soup.find_all('div class="price_big"')


    # define where we will store info
    vendors = []
    shipsfroms = []
    shipstos = []
    prices = []

    for html in metaData:
        text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "") 
    vendors.append(text.split("vendor:")[1].split("ships from:")[0].strip())
    shipsfroms.append(text.split("ships from:")[1].split("ships to:")[0].strip()) 
    shipstos.append(text.split("ships to:")[1].strip())

for price in priceData:
    prices.append(BeautifulSoup(str(price)).get_text().encode("utf-8").strip())

csvfile = open('drugs.csv', 'ab')
writer = csv.writer(csvfile)

for shipsto, shipsfrom, vendor, price in zip(shipstos, shipsfroms, vendors, prices):
    writer.writerow([shipsto, shipsfrom, vendor, price])

csvfile.close()

dir = "drugs"

csvFile = "drugs.csv"

csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Vendors", "ShipsTo", "ShipsFrom", "Prices"])
csvfile.close()

fileList = os.listdir(dir)

totalLen = len(fileList)
count = 1

for htmlFile in fileList:
    path = os.path.join(dir, htmlFile)
    processData(path)
    print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..." 
    count = count + 1

I suspect that I'm telling BS to look in the wrong part of the html code? But I can't see what it should be instead. Here's an excerpt of the html code with the info I need:

</div>
<div class="item" style="overflow: hidden;">
  <div class="item_image" style="width: 180px; height: 125px;" id="image_255"><a   href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt" style="display: block;  width: 180px; height: 125px;"></a></div>
  <div class="item_body">
    <div class="item_title"><a href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt">200mg High Quality DMT</a></div>
    <div class="item_details">
      vendor: <a href="https://silkroad6ownowfk.onion.to/users/ringo-deathstarr">ringo deathstarr</a><br>
      ships from: United States<br>
      ships to: Worldwide
    </div>
  </div>
  <div class="item_price">
    <div class="price_big">฿0.031052</div>
    <a href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt#shipping">add to cart</a>
  </div>
</div>

Disclaimer: the information is for a research project about online drug trade.

Upvotes: 0

Views: 178

Answers (1)

Sabuj Hassan
Sabuj Hassan

Reputation: 39443

The way you are doing is wrong. Here is a working example:

metaData = soup.find_all("div", {"class":"item_details"})
priceData = soup.find_all("div", {"class":"price_big"})

You can find more about it's usage from here.

Upvotes: 1

Related Questions