Britt
Britt

Reputation: 581

Using diff with beautiful soup objects

I am trying to compare the text all instances of a particular tag in two XML files. The OCR engine I am using outputs an xml files with all the ocr chraracters in a tag <OCRCharacters>...</OCRCharacters>.

I am using python 2.7.11 and beautiful soup 4 (bs4). From the terminal, I am calling my python program with two xml file names as arguments.

I want to extract all the strings in the <OCRCharacters> tag for each file, compare them line by line with difflib, and write a new file with the differences.

I use $ python parse_xml_file.py file1.xml file2.xml to call the program from the terminal.

The code below opens each file and prints each string in the tag <OCRCharacters>. How should I convert the objects made with bs4 to strings that I can use with difflib. I am open to better ways (using python) to do this.

import sys

with open(sys.argv[1], "r") as f1:
    xml_doc_1 = f1.read()

with open(sys.argv[2], "r") as f2:
    xml_doc_2 = f2.read()

from bs4 import BeautifulSoup
soup1 = BeautifulSoup(xml_doc_1, 'xml')
soup2 = BeautifulSoup(xml_doc_2, 'xml')

print("#####################",sys.argv[1],"#####################")
for tag in soup1.find_all('OCRCharacters'):
    print(repr(tag.string))
    temp1 = repr(tag.string)
    print(temp1)
print("#####################",sys.argv[2],"#####################")    
for tag in soup2.find_all('OCRCharacters'):
    print(repr(tag.string))
    temp2 = repr(tag.string)

Upvotes: 2

Views: 3142

Answers (1)

SLePort
SLePort

Reputation: 15461

You can try this :

import sys
import difflib
from bs4 import BeautifulSoup

text = [[],[]]
files = []
soups = []

for i, arg in enumerate(sys.argv[1:]):
  files.append(open(arg, "r").read())
  soups.append(BeautifulSoup(files[i], 'xml'))

  for tag_text in soups[i].find_all('OCRCharacters'):
    text[i].append(''.join(tag_text))

for first_string, second_string in zip(text[0], text[1]):
    d = difflib.Differ()
    diff = d.compare(first_string.splitlines(), second_string.splitlines())
    print '\n'.join(diff)

With xml1.xml :

<node>
  <OCRCharacters>text1_1</OCRCharacters>
  <OCRCharacters>text1_2</OCRCharacters>
  <OCRCharacters>Same Value</OCRCharacters>
</node>

and xml2.xml :

<node>
  <OCRCharacters>text2_1</OCRCharacters>
  <OCRCharacters>text2_2</OCRCharacters>
  <OCRCharacters>Same Value</OCRCharacters>
</node>

The output will be :

- text1_1
?     ^

+ text2_1
?     ^

- text1_2
?     ^

+ text2_2
?     ^

  Same Value

Upvotes: 2

Related Questions