Reputation: 899
I have the following xml file:
<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>
</article_date>
</root>
I would like to transform it to the following file:
<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1+aaa3+aaa5</article_name>
<article_link>1aaaaaaa+3aaaaaaa+5aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2+aaa4</article_name>
<article_link>2aaaaaaa+4aaaaaaa</article_link>
</article_time>
</root>
How can I do it in python?
My approach to do this task is the following: 1) loop through tags 2) form dictionary key- either 0 or 1, value - 3) for each element in this dictionary find all child nodes: and and append them
Since that, I wrote the following code to implement this (ps I am currently struggling with adding elements to the dictionary, but I will overcome this issue):
def parse():
list_of_inique_timestamps=[]
text_to_merge=""
tree=et.parse("~/Documents/test1.xml")
root=tree.getroot()
for children in root:
print children.tag, children.text
for child in children:
print (child.tag,int(child.text))
if not child.text in list_of_inique_timestamps:
list_of_inique_timestamps.append(child.text)
print list_of_inique_timestamps
Upvotes: 3
Views: 1964
Reputation: 473903
Here's the solution using xml.etree.ElementTree
from python standard library.
The idea is to gather items into defaultdict(list)
per article_time
text value:
from collections import defaultdict
import xml.etree.ElementTree as ET
data = """<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>
</article_date>
</root>
"""
tree = ET.fromstring(data)
root = ET.Element('root')
article_date = ET.SubElement(root, 'article_date')
article_date.text = tree.find('.//article_date').text
data = defaultdict(list)
for article_time in tree.findall('.//article_time'):
text = article_time.text.strip()
name = article_time.find('./article_name').text
link = article_time.find('./article_link').text
data[text].append((name, link))
for time_value, items in data.iteritems():
article_time = ET.SubElement(article_date, 'article_time')
article_name = ET.SubElement(article_time, 'article_name')
article_link = ET.SubElement(article_time, 'article_name')
article_time.text = time_value
article_name.text = '+'.join(name for (name, _) in items)
article_link.text = '+'.join(link for (_, link) in items)
print ET.tostring(root)
prints (prettified):
<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1+aaa3+aaa5</article_name>
<article_name>1aaaaaaa+3aaaaaaa+5aaaaaaa</article_name>
</article_time>
<article_time>0
<article_name>aaa2+aaa4</article_name>
<article_name>2aaaaaaa+4aaaaaaa</article_name>
</article_time>
</article_date>
</root>
See, the result is exactly what you were aiming to.
Upvotes: 2
Reputation: 2127
I'll write as much as I have time (and knowledge), but I'm making this a community wiki so other folks can help.
I would suggest using xml or BeautifulSoup libraries for this. I'll use BeautifulSoup because I can't get xml to work for some reason right now.
First, let's get set up:
>>> import bs4
>>> soup = bs4.BeautifulSoup('''<root>
... <article_date>09/09/2013
... <article_time>1
... <article_name>aaa1</article_name>
... <article_link>1aaaaaaa</article_link>
... </article_time>
... <article_time>0
... <article_name>aaa2</article_name>
... <article_link>2aaaaaaa</article_link>
... </article_time>
... <article_time>1
... <article_name>aaa3</article_name>
... <article_link>3aaaaaaa</article_link>
... </article_time>
... <article_time>0
... <article_name>aaa4</article_name>
... <article_link>4aaaaaaa</article_link>
... </article_time>
... <article_time>1
... <article_name>aaa5</article_name>
... <article_link>5aaaaaaa</article_link>
... </article_time>
... </root>''')
This just produces an internal representation of your xml. We can use the find_all
method to grab all the article times.
>>> children = soup.find_all('article_time')
>>> children
[<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>, <article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>, <article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>, <article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>, <article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>]
The next thing to do is define a key for how we define 'similar' parent nodes. Let's write a key
function that specifies which part of each child to look at. We'll do some poking around to learn about the structure of each child first.
>>> children[0].contents
[u'1\n ', <article_name>aaa1</article_name>, u'\n', <article_link>1aaaaaaa</article_link>, u'\n']
>>> children[0].contents[0]
u'1\n '
>>> int(children[0].contents[0])
1
>>> def key(child):
... return int(child.contents[0])
...
>>> key(children[0])
1
>>> key(children[1])
0
Okay. Now we can take advantage of python's itertools.groupby function, which will group together all the children with the same key (we need to sort first). We will use the newly defined key
function to specify how to sort, and what defines a group.
>>> children = sorted(children, key=key)
>>> import itertools
>>> groups = itertools.groupby(children, key)
groups
is a generator -- like a list, but we can only iterate through it once. Let's take a look at what makes it up, even though that will mean we have to recreate it later. (We only get a single pass for generators, so by looking at the data, we're losing it. Luckily, it's pretty easy to recreate)
>>> for k, g in groups:
... print k, ':\t', list(g)
...
0 : [<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>, <article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>]
1 : [<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>, <article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>, <article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>]
Okay, so k
specifies what key was used to produce the group, and g is a sequence of the article_time
s that matched k
.
Sorry, that's all I have time for at the moment. Hopefully this is enough to get you started.
Upvotes: 1