Reputation: 581
I am trying to compare the text all instances of a particular tag in two XML files. The OCR engine I am using outputs an xml files with all the ocr chraracters in a tag <OCRCharacters>...</OCRCharacters>
.
I am using python 2.7.11 and beautiful soup 4 (bs4). From the terminal, I am calling my python program with two xml file names as arguments.
I want to extract all the strings in the <OCRCharacters>
tag for each file, compare them line by line with difflib, and write a new file with the differences.
I use $ python parse_xml_file.py file1.xml file2.xml
to call the program from the terminal.
The code below opens each file and prints each string in the tag <OCRCharacters>
. How should I convert the objects made with bs4 to strings that I can use with difflib. I am open to better ways (using python) to do this.
import sys
with open(sys.argv[1], "r") as f1:
xml_doc_1 = f1.read()
with open(sys.argv[2], "r") as f2:
xml_doc_2 = f2.read()
from bs4 import BeautifulSoup
soup1 = BeautifulSoup(xml_doc_1, 'xml')
soup2 = BeautifulSoup(xml_doc_2, 'xml')
print("#####################",sys.argv[1],"#####################")
for tag in soup1.find_all('OCRCharacters'):
print(repr(tag.string))
temp1 = repr(tag.string)
print(temp1)
print("#####################",sys.argv[2],"#####################")
for tag in soup2.find_all('OCRCharacters'):
print(repr(tag.string))
temp2 = repr(tag.string)
Upvotes: 2
Views: 3142
Reputation: 15461
You can try this :
import sys
import difflib
from bs4 import BeautifulSoup
text = [[],[]]
files = []
soups = []
for i, arg in enumerate(sys.argv[1:]):
files.append(open(arg, "r").read())
soups.append(BeautifulSoup(files[i], 'xml'))
for tag_text in soups[i].find_all('OCRCharacters'):
text[i].append(''.join(tag_text))
for first_string, second_string in zip(text[0], text[1]):
d = difflib.Differ()
diff = d.compare(first_string.splitlines(), second_string.splitlines())
print '\n'.join(diff)
With xml1.xml :
<node>
<OCRCharacters>text1_1</OCRCharacters>
<OCRCharacters>text1_2</OCRCharacters>
<OCRCharacters>Same Value</OCRCharacters>
</node>
and xml2.xml :
<node>
<OCRCharacters>text2_1</OCRCharacters>
<OCRCharacters>text2_2</OCRCharacters>
<OCRCharacters>Same Value</OCRCharacters>
</node>
The output will be :
- text1_1
? ^
+ text2_1
? ^
- text1_2
? ^
+ text2_2
? ^
Same Value
Upvotes: 2