Kronen
Kronen

Reputation: 466

Data parsing from a table with BeautifulSoup

I'm new to BeautifulSoup and I've been struggling with data parsing from a table:

<table id="data">
    <tr>
      <td class="random.data"></td>
      <td class="name"></td>
      <td class="values"></td> <!-- 0 -->
      <td class="values"></td> <!-- 1 -->
      <td class="values"></td> <!-- 2 -->
      <td class="values"></td> <!-- 3 -->
    </tr>
    <tr>
      <td class=".random_data"></td>
      <td class="name"></td>
      <td class="values"></td> <!-- 0 -->
      <td class="values"></td> <!-- 1 -->
      <td class="values"></td> <!-- 2 -->
      <td class="values"></td> <!-- 3 -->
    </tr>
</table>

I want to create a list of dictionaries like this pseudocode:

content = []
for tr in trs:
    info = {
        'name': tr.getChildren('.name').getText(),
        'value1': tr.getChildren('.values', 0).getText() # the first value from values
        'value3': tr.getChildren('.values', 3).getText() # the fourth value from values
    }
    content.append(info)

But I've been trying around and failing miserably to translate this into BeautifulSoup, any help or hint?

Upvotes: 1

Views: 99

Answers (1)

alecxe
alecxe

Reputation: 473853

The idea is to iterate over table rows and, for ever row, find the name by the class name, all the values by the values class name and get the desired values by index:

from bs4 import BeautifulSoup

data = """
<table id="data">
    <tr>
      <td class="random.data"></td>
      <td class="name">test1</td>
      <td class="values">0</td>
      <td class="values">1</td>
      <td class="values">2</td>
      <td class="values">3</td>
    </tr>
    <tr>
      <td class=".random_data"></td>
      <td class="name">test2</td>
      <td class="values">0</td> 
      <td class="values">1</td> 
      <td class="values">2</td> 
      <td class="values">3</td>
    </tr>
</table>
"""

soup = BeautifulSoup(data)

data = []
for row in soup.select("table#data tr"):
    name = row.find("td", class_="name").get_text(strip=True)
    values = row.find_all("td", class_="values")

    data.append({
        "name": name,
        "value1": values[0].get_text(strip=True),
        "value3": values[3].get_text(strip=True)
    })

print data

Prints:

[
    {'value3': u'3', 'name': u'test1', 'value1': u'0'}, 
    {'value3': u'3', 'name': u'test2', 'value1': u'0'}
]

Upvotes: 1

Related Questions