Simon Kiely
Simon Kiely

Reputation: 6040

Extracting particular element in HTML file and inserting into CSV

I have a HTML table stored in a file. I want to take each td value from the table which has the attribute like so :

<td describedby="grid_1-1" ... >Value for CSV</td>
<td describedby="grid_1-1" ... >Value for CSV2</td>
<td describedby="grid_1-1" ... >Value for CSV3</td>
<td describedby="grid_1-2" ... >Value for CSV4</td>

and I want to put it into a CSV file, with each new value taking up a new line in the CSV.

So for the file above, the CSV produced would be :

Value for CSV
Value for CSV2
Value for CSV3

Value for CSV4 would be ignored as describedby="grid_1-2", not "grid_1-1".

So I have tried this, however no matter what I try there seems to be (a) a blank line in between each printed line (b) a comma separating each char.

So the print is more like :

V,a,l,u,e,f,o,r,C,S,V,

V,a,l,u,e,f,o,r,C,S,V,2

What silly thing have I done now?

Thanks :)

import csv
import os
from bs4 import BeautifulSoup

with open("C:\\Users\\ADMIN\\Desktop\\test.html", 'r') as orig_f:
    soup = BeautifulSoup(orig_f.read())
    results = soup.findAll("td", {"describedby":"grid_1-1"})
    with open('C:\\Users\\ADMIN\\Desktop\\Deploy.csv', 'wb') as fp:
        a = csv.writer(fp, delimiter=',')
        for result in results :
            a.writerows(result)

Upvotes: 0

Views: 129

Answers (2)

Vivek Sable
Vivek Sable

Reputation: 10213

use lxml and csv module.

  1. Get all td text value which attribute describedby have value grid_1-1 by xpath() method of lxml.
  2. Open csv file in write mode.
  3. writer row into csv file by writerow() method

code:

content = """
<body>
<td describedby="grid_1-1">Value for CSV</td>
<td describedby="grid_1-1">Value for CSV2</td>
<td describedby="grid_1-1">Value for CSV3</td>
<td describedby="grid_1-2">Value for CSV4</td>
</body>
"""
from lxml import etree
import csv
root = etree.fromstring(content)
l = root.xpath("//td[@describedby='grid_1-1']/text()")

with open('/home/vivek/Desktop/output.csv', 'wb') as fp:
     a = csv.writer(fp, delimiter=',')
     for i in l :
         a.writerow([i, ])

output:

Value for CSV
Value for CSV2
Value for CSV3
Value for CSV4

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

If result is a string inside a list you need to wrap it in a list as writerows expects an iterable of iterables and iterates over the string:

a.writerows([result]) <- wrap in a list 

In your case you should use writerow and extract the text from each td tag in results:

  a.writerow([result.text]) # write the text from td element

You have all the td tags in your result list so you just need extract the text with .text.

Upvotes: 3

Related Questions