Reputation: 362
I have a file that contains text as well as some xml content dumped into it. It looks something like this :
The authentication details : <id>70016683</id><password>password@123</password>
The next step is to send the request.
The request : <request><id>90016133</id><password>password@3212</password></request>
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
I am using a python program to parse this file. I would like to replace the xml part with a place holder : xml_obj. The output should look something like this :
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj
At the same time I would also like to extract the replaced xml text and store it in a list. The list should contain None if the line doesn't have an xml object.
xml_tag = re.search(r"<\w*>",line)
if xml_tag:
start_position = xml_tag.start()
xml_word = xml_tag.group()[:1]+'/'+xml_tag.group()[1:]
xml_pattern = r'{}'.format(xml_word)
stop_position = re.search(xml_pattern,line).stop()
But this code retrieves the start and stop positions for only one xml tag and it's content for the first line and the entire format for the last line ( in the input file ). I would like to get all xml content irrespective of the xml structure and also replace it with 'xml_obj'.
Any advice would be helpful. Thanks in advance.
Edit :
I also want to apply the same logic to files that look like this :
The authentication details : ID <id>70016683</id> Password <password>password@123</password> Authentication details complete
The next step is to send the request.
The request : <request><id>90016133</id><password>password@3212</password></request> Request successful
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
The above files may have more than one xml object in a line.
They may also have some plain text after the xml part.
Upvotes: 2
Views: 85
Reputation: 24928
The following is a a little convoluted, but assuming that your actual text is correctly represented by the sample in your question, try this:
txt = """[your sample text above]"""
lines = txt.splitlines()
entries = []
new_txt = ''
for line in lines:
entry = (line.replace(' <',' xxx<',1).split('xxx'))
if len(entry)==2:
entries.append(entry[1])
entry[1]="xml_obj"
line=''.join(entry)
else:
entries.append('none')
new_txt+=line+'\n'
for entry in entries:
print(entry)
print('---')
print(new_txt)
Output:
<id>70016683</id><password>password@123</password>
none
<request><id>90016133</id><password>password@3212</password></request>
<Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
---
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj
Upvotes: 1