mr.M
mr.M

Reputation: 899

Merge child nodes with the similar parent node, xml, python

I have the following xml file:

<root>
    <article_date>09/09/2013
    <article_time>1
        <article_name>aaa1</article_name>
        <article_link>1aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa2</article_name>
        <article_link>2aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa3</article_name>
        <article_link>3aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa4</article_name>
        <article_link>4aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa5</article_name>
        <article_link>5aaaaaaa</article_link>
    </article_time>
    </article_date>
</root>

I would like to transform it to the following file:

<root>
    <article_date>09/09/2013
    <article_time>1
        <article_name>aaa1+aaa3+aaa5</article_name>
        <article_link>1aaaaaaa+3aaaaaaa+5aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa2+aaa4</article_name>
        <article_link>2aaaaaaa+4aaaaaaa</article_link>
    </article_time>
</root>

How can I do it in python?

My approach to do this task is the following: 1) loop through tags 2) form dictionary key- either 0 or 1, value - 3) for each element in this dictionary find all child nodes: and and append them

Since that, I wrote the following code to implement this (ps I am currently struggling with adding elements to the dictionary, but I will overcome this issue):

def parse():
list_of_inique_timestamps=[]
text_to_merge=""
tree=et.parse("~/Documents/test1.xml")
root=tree.getroot()
for children in root:
    print children.tag, children.text
    for child in children:
        print (child.tag,int(child.text))
        if not child.text in list_of_inique_timestamps:
            list_of_inique_timestamps.append(child.text)
print list_of_inique_timestamps

Upvotes: 3

Views: 1964

Answers (2)

alecxe
alecxe

Reputation: 473903

Here's the solution using xml.etree.ElementTree from python standard library.

The idea is to gather items into defaultdict(list) per article_time text value:

from collections import defaultdict
import xml.etree.ElementTree as ET

data = """<root>
    <article_date>09/09/2013
    <article_time>1
        <article_name>aaa1</article_name>
        <article_link>1aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa2</article_name>
        <article_link>2aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa3</article_name>
        <article_link>3aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa4</article_name>
        <article_link>4aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa5</article_name>
        <article_link>5aaaaaaa</article_link>
    </article_time>
    </article_date>
</root>
"""

tree = ET.fromstring(data)

root = ET.Element('root')
article_date = ET.SubElement(root, 'article_date')
article_date.text = tree.find('.//article_date').text

data = defaultdict(list)
for article_time in tree.findall('.//article_time'):
    text = article_time.text.strip()
    name = article_time.find('./article_name').text
    link = article_time.find('./article_link').text
    data[text].append((name, link))

for time_value, items in data.iteritems():
    article_time = ET.SubElement(article_date, 'article_time')
    article_name = ET.SubElement(article_time, 'article_name')
    article_link = ET.SubElement(article_time, 'article_name')

    article_time.text = time_value
    article_name.text = '+'.join(name for (name, _) in items)
    article_link.text = '+'.join(link for (_, link) in items)

print ET.tostring(root)

prints (prettified):

<root>
    <article_date>09/09/2013
        <article_time>1
            <article_name>aaa1+aaa3+aaa5</article_name>
            <article_name>1aaaaaaa+3aaaaaaa+5aaaaaaa</article_name>
        </article_time>
        <article_time>0
            <article_name>aaa2+aaa4</article_name>
            <article_name>2aaaaaaa+4aaaaaaa</article_name>
        </article_time>
    </article_date>
</root>

See, the result is exactly what you were aiming to.

Upvotes: 2

bgschiller
bgschiller

Reputation: 2127

I'll write as much as I have time (and knowledge), but I'm making this a community wiki so other folks can help.

I would suggest using xml or BeautifulSoup libraries for this. I'll use BeautifulSoup because I can't get xml to work for some reason right now.

First, let's get set up:

>>> import bs4
>>> soup = bs4.BeautifulSoup('''<root>
...     <article_date>09/09/2013
...     <article_time>1
...         <article_name>aaa1</article_name>
...         <article_link>1aaaaaaa</article_link>
...     </article_time>
...     <article_time>0
...         <article_name>aaa2</article_name>
...         <article_link>2aaaaaaa</article_link>
...     </article_time>
...     <article_time>1
...         <article_name>aaa3</article_name>
...         <article_link>3aaaaaaa</article_link>
...     </article_time>
...     <article_time>0
...         <article_name>aaa4</article_name>
...         <article_link>4aaaaaaa</article_link>
...     </article_time>
...     <article_time>1
...         <article_name>aaa5</article_name>
...         <article_link>5aaaaaaa</article_link>
...     </article_time>
... </root>''')

This just produces an internal representation of your xml. We can use the find_all method to grab all the article times.

>>> children = soup.find_all('article_time')
>>> children
[<article_time>1
        <article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>, <article_time>0
        <article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>, <article_time>1
        <article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>, <article_time>0
        <article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>, <article_time>1
        <article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>]

The next thing to do is define a key for how we define 'similar' parent nodes. Let's write a key function that specifies which part of each child to look at. We'll do some poking around to learn about the structure of each child first.

>>> children[0].contents
[u'1\n        ', <article_name>aaa1</article_name>, u'\n', <article_link>1aaaaaaa</article_link>, u'\n']
>>> children[0].contents[0]
u'1\n        '
>>> int(children[0].contents[0])
1
>>> def key(child):
...     return int(child.contents[0])
...
>>> key(children[0])
1
>>> key(children[1])
0

Okay. Now we can take advantage of python's itertools.groupby function, which will group together all the children with the same key (we need to sort first). We will use the newly defined key function to specify how to sort, and what defines a group.

>>> children = sorted(children, key=key)
>>> import itertools
>>> groups = itertools.groupby(children, key)

groups is a generator -- like a list, but we can only iterate through it once. Let's take a look at what makes it up, even though that will mean we have to recreate it later. (We only get a single pass for generators, so by looking at the data, we're losing it. Luckily, it's pretty easy to recreate)

>>> for k, g in groups:
...     print k, ':\t', list(g)
...
0 : [<article_time>0
        <article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>, <article_time>0
        <article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>]
1 : [<article_time>1
        <article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>, <article_time>1
        <article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>, <article_time>1
        <article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>]

Okay, so k specifies what key was used to produce the group, and g is a sequence of the article_times that matched k.

Sorry, that's all I have time for at the moment. Hopefully this is enough to get you started.

Upvotes: 1

Related Questions