Reputation: 344
I need to find text within a HTML doc. The doc is a generated report and the text isn't within any HTML tags.. I need to find the text "test". I have tried the following code lines without any luck.. Below is a sample of the HTML doc. Also, if possible. I would like to then merge/move the name on the same line as "test" to the end of "NAME3" after "BILL". The names on the right are dynamic and change all the time. The left column are static and don't change So the final result would be;
<END RESULT>
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size=-4>
test......... DOUG
NAME2........... HENRY
NAME3... BILL , DOUG
NAME4...... BOB
test......... ALLAN
NAME2........... MICHAEL
NAME3... MITCHELL, ALLAN
NAME4...... TOM
</pre>
</body>
</html>
<SAMPLE CODE>
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size=-4>
test......... DOUG
NAME2........... HENRY
NAME3... BILL
NAME4...... BOB
test......... ALLAN
NAME2........... MICHAEL
NAME3... MITCHELL
NAME4...... TOM
</pre>
</body>
</html>
result = soup.find(text = "test")
result = soup.find(text = 'test')
result = soup.find_all(text = "test")
result = soup.find_all(text = 'test')
Upvotes: 1
Views: 103
Reputation: 24930
If I understand you correctly, you are probably looking for something like this:
from bs4 import BeautifulSoup as bs
namepage = """[your sample code above, fixed - font wasn't closed]"""
soup = bs(namepage,'lxml')
result=soup.find('font')
names = result.text.strip()
newnames= ''
for name in names.splitlines():
if "test" in name:
target= name.split('. ')[1]
if "NAME3" in name:
name += ", "+target
newnames+='\n'+name
result.string.replace_with(' '.join([(elem+'\n') for elem in newnames.splitlines()]) )
soup
Output:
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size="-4">
test......... DOUG
NAME2........... HENRY
NAME3... BILL, DOUG
NAME4...... BOB
test......... ALLAN
NAME2........... MICHAEL
NAME3... MITCHELL, ALLAN
NAME4...... TOM
</font>
</pre>
</body>
</html>
Upvotes: 1