Mark Brown
Mark Brown

Reputation: 914

Beautiful Soup: getting contents of all <tag> in xml-ish file

I've got an xml-ish file I'm trying to parse with BeautifulSoup with let's say an unknown multiple of tags within the tree of another tag. Things go swimmingly, at least for the first tag I'm extracting within the set of nexted tags. This isn't really html or xml, but close...

Given the format:

<data>
<type>
    <type_attribute_1>1</type_attribute_1>
    <type_attribute_2>2</type_attribute_2>
</type>
<type>
    <type_attribute_1>3</type_attribute_1>
    <type_attribute_2>4</type_attribute_2>
</type>
</data>

How might I extract the values of type_attribute_1 and type_attribute_2 for both type tags and assign to a variable -- i.e. "Type_1_attribute_1", "Type_1_attribute_2", "Type_2_attribute_1" & "Type_2_attribute_2"

I'm using code like this, but it only works on the first <type> located within the <data>:

Type_1_Attribute_1 = soup.data.type.type_attribute_1.text
Type_1_Attribute_2 = soup.data.type.type_attribute_2.text

UPDATE

I think to phrase what I'm looking for a little differently may help. Instead of declaring the variable name Type_1_Attribute_1, as I don't know how many Type siblings there are, tack "_1", "_2", "_3"... on to "Type, for each sibling. i.e.
Assuming:

Types = [i.stripText() for i in soup.select('Type')]
parseables = len(Types)
for i in range(0, parseables)
    j = i+1
    Type = Types[i]
    Attribute_1 = Type.Type_Attribute_1.text 
    print Attribute_1

Which prints the value of Attribute_1 for each Type, How would I add "Type_j" in Attribute_1 to be filled in with j's value?

Upvotes: 2

Views: 1706

Answers (1)

Learner
Learner

Reputation: 5302

What about this-

from bs4 import BeautifulSoup as bs

data  = """<data>
<type>
    <type_attribute_1>1</type_attribute_1>
    <type_attribute_2>2<2/type_attribute_2>
</type>
<type>
    <type_attribute_1>3</type_attribute_1>
    <type_attribute_2>4</type_attribute_2>
</type>
</data>"""

soup = bs(data,'lxml')

Type_1_Attribute_1 = [i.text.strip() for i in soup.select('type_attribute_1')]
Type_1_Attribute_2 = [i.text.strip() for i in soup.select('type_attribute_2')]

print filter(bool,Type_1_Attribute_1)
print filter(bool,Type_1_Attribute_2)

Output-

[u'1', u'3']
[u'2', u'4']

EDIT I do not get you, why you need this where looping over the list itself a variable (iterator)- e.g

for i in Type_1_Attribute_1:
    print (i)# here i itself a variable and it changes when i reiterate

Prints-

1
3

So if you need to use every item from that list just use iterator and pass to a function as i passed to print function.

Upvotes: 2

Related Questions