Reputation: 207

Python beautiful soup merging rows

here is the html

<table>
<tr>
<td class="break">mono</td>
</tr>
<tr>
<td>c1</td>
<td>c2</td>
<td>c3</td>
</tr>
<tr>
<td>c11</td>
<td>c22</td>
<td>c33</td>
</tr>
<tr>
<td class="break">dono</td>
</tr>
<tr>
<td>d1</td>
<td>d2</td>
<td>d3</td>
</tr>
<tr>
<td>d11</td>
<td>d22</td>
<td>d33</td>
</tr>
</table>

Now I want output like this in a csv file:

mono c1 c2 c3
mono c11 c22 c33
dono d1 d2 d3
dono d11 d22 d33

But I am getting output like this:

mono
c1 c2 c3
c11 c22 c33
dono
d1 d2 d3
d11 d22 d33

Here is my code:

import codecs
from bs4 import BeautifulSoup
with codecs.open('dump.csv', "w", encoding="utf-8") as csvfile:


    f = open("input.html","r")

    soup = BeautifulSoup(f)
    t = soup.findAll('table')
    for table in t:
        rows = table.findAll('tr')
        for tr in rows:
            cols = tr.findAll('td')
            for td in cols:
                csvfile.write(str(td.find(text=True)))
                csvfile.write(",")
            csvfile.write("\n")

Please help me to resolve this issue.Thanks.

Edit:

Explained with some more details.Here I need to add first section (mono,dono etc) to be appended.

The rule here is that unless I encountered a new "break" class,text inside of that class should be appended to any tr below that.

Upvotes: 3

Answers (4)

abarnert

Reputation: 365707

Since your new question is effectively an entirely different question from the original, here's an entirely different answer:

for table in t:
    rows = table.findAll('tr')
    for row in rows:
        cols = row.findAll('td')
        if 'break' in cols[0].get('class', []):
            header = cols[0].text
        else:
            print header, ' '.join(col.text for col in cols)

I'm assuming that a row will either be exactly 1 "break" column, or 1 or more regular columns. If those assumptions aren't true, the code can be modified.

Also, if the generator expression in the join function confuses you, the same thing can be rewritten as an explicit loop: print the header; then for each column, print that column; then print a newline.

Since you asked for an explanation of 'break' in cols[0].get('class', []), I'll break it down.

cols is a list of the BS4 Tag objects for every td nodes in the current tr node.
cols[0] is the first one.
cols[0].get('class', []) treats the Tag object as a dictionary, as described in the docs, and calls the familiar get(key, defaultvalue) method on it.
- In BS4 (unlike older versions), looking up Tag attributes by name always returns a list. While BS3 would return 'foo bar' for <td class='foo bar'> and 'bar' for <td class='foo' class='bar'>, BS4 will return ['foo', 'bar'] for both.
Putting it all together, cols[0].get('class', []) will be ['break'] for the <td class='break'> case, and [] for all of the other cases in your sample input.

As mentioned above, I'm assuming that a row will either be exactly 1 "break" column, or 1 or more regular columns. You can see where I'm making use of those assumptions in the code. But if any of those assumptions are broken, you haven't told us enough to know what you want to do in those cases.

If you have any rows with no columns, obviously the cols[0] will raise an IndexError. But you have to decide what to do in that case. Should it do nothing? Print just the header? Change to a state where nothing gets printed until we see a header row? Whatever you decide, it should be easy to code.

If you have any rows with a header followed by normal rows, the normal rows will be ignored. If you have any headers that aren't the first column in a row, they will be treated like normal values. If you have multiple headers in the same row, all but the first will be ignored. And so on. In each case, this may or may not be what. But you have to decide what you want, before you can write the code.

Upvotes: 3

abarnert

Reputation: 365707

If you want to run all the rows in the table together, why not just ignore the rows?

for table in t:
    cols = table.findAll('td')
    for td in cols:
        csvfile.write(str(td.find(text=True)))
        csvfile.write(",")
    csvfile.write("\n")

Half the reason to use BeautifulSoup instead of a strict parser is to let you play loose with the structure (the other half is to let you deal with people who played loose while generating the structure). So, why go row by row and then try to ignore the row-by-rowness when you can just go column by column?

You'd be much better off using the csv module than trying to format it manually, but that's a separate issue.

Upvotes: 1

Blender

Reputation: 298166

Use the built-in csv module for working with CSV files. It's much easier than manually doing it.

As for your problem, this is happening because your csvfile.write('\n') is indented too far, so the data is written just like it appears in the table. Make a generator instead and it should work:

import csv
from bs4 import BeautifulSoup

def get_fields(soup):
    for td in soup.find_all('td'):
        yield td.get_text().strip()

with open('csvfile.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)

    with open('input.html', 'r') as handle:
        soup = BeautifulSoup(handle.read())

    fields = list(get_fields(soup))

    writer.writerow(fields)

Upvotes: 2

Alex L

Reputation: 8925

Have you tried un-indenting csvfile.write("\n") so that it occurs at the end of the table loop, not the tr loop?

Upvotes: 1

Python beautiful soup merging rows

Answers (4)

Related Questions