Reputation: 404
I have a 'cell' variable. Please note it is NOT a htm or html file. It is the content of .xlsx file cell. The text in it has many links (there are only 2 here for example), and they all should be replaced.
There is also a txt file with original links and links for replacement. After parsing of the txt file we have 2 lists:
is_list - list of links which should be deleted
should_be_list - list of links which should be instead of the deleted ones.
so
import re
cell = b'<div> <h2>About Us</h2> <div> <img alt="Image title" src="[findr-path]4c8d7faa-73f0-4acd-a8e4-5dc02b5501a3"> </div> <p>A Caring Home Care Services started in 2007 in Southwestern Louisiana. Our Mission is to provide quality Homemodel. </p> <p>Below is a list of services you will provide as a Franchisee </p> <ul> <li>Apartment and Home Cleaning</li> <li>Chef Services</li> <li>Handyman and Remodeling Services</li> <li>In-Home Non-Medical Elderly Care</li> <li>Interior Decorator</li> <li>Lawn Care Services</li> </ul> <div> <img alt="Image title" src="[findr-path]2b408b1a-3ea8-446e-9856-6421d9a3c562"> </div> <p>If you are an Entrepreneur and looking to get in the Home care Industry, then A Caring Home Care today, and we will mail you out our Franchisee Information Booklet. Come join our winning TEAM.</em></p> </div>'
is_list = ['<img alt="Image title" src="[findr-path]4c8d7faa-73f0-4acd-a8e4-5dc02b5501a3">',
'<img alt="Image title" src="[findr-path]2b408b1a-3ea8-446e-9856-6421d9a3c562"> ']
should_be_list = ['<img alt="another title" src="[findr-path]image_1_2.jpg">',
'<img alt="other other title" src="[findr-path]image_2_5.jpg"> ']
if I try to use replace - I get this error:
for i in range(2):
cell.replace(is_list[i], should_be_list[i])
print (cell)
"""
Traceback (most recent call last):
File "I:\15.py", line 11, in <module>
cell.replace(is_list[i], should_be_list[i])
TypeError: 'str' does not support the buffer interface
"""
if I try to use re.sub, I get this error:
for i in range(2):
result = re.sub(is_list[i], should_be_list[i], cell)
print (cell)
"""
Traceback (most recent call last):
File "I:\15.py", line 24, in <module>
result = re.sub(is_list[i], should_be_list[i], cell)
File "c:\Python34\lib\re.py", line 179, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "c:\Python34\lib\re.py", line 294, in _compile
p = sre_compile.compile(pattern, flags)
File "c:\Python34\lib\sre_compile.py", line 568, in compile
p = sre_parse.parse(p, flags)
File "c:\Python34\lib\sre_parse.py", line 760, in parse
p = _parse_sub(source, pattern, 0)
File "c:\Python34\lib\sre_parse.py", line 370, in _parse_sub
itemsappend(_parse(source, state))
File "c:\Python34\lib\sre_parse.py", line 516, in _parse
raise error("bad character range")
sre_constants.error: bad character range
"""
Please, help. How to do this replacement?
Upvotes: 0
Views: 751
Reputation: 77357
Encode the text and use that. I'm choosing ascii because I don't know enough about how the original text files and embedded urls are encoded. There are several ways to deal with url encodings (and hostname tended to be different than path and query) and I think I'll avoid touching that third rail here.
is_list_b = [item.encode('ascii') for item in is_list]
should_be_list_b = [teim.encode('ascii') for item in should_be_list]
...
Upvotes: 1