Reputation: 959
So I'm going through a text and I need to replace a bunch of CIDs (characters that were not readable when I scraped them). I need to replace every "cid:###" with the correct character. The issue that I'm currently running into is that some CIDs are wrapped around in <s></s>
and there is no space between <s>(cid:131)</s>
and the next word.
So, when I use replace, it doesn't work when I try to replace <s>(cid:131)</s>
to ▪. When I try to replace cid:131 with ▪, I get <s>▪</s>
. I'm trying to get rid of the <s></s>
for this specific case (<s></s>
is found in other places in the document and I don't want to replace those).
Doesn't change anything:
csv_of_table = csv_of_table.replace('<s>(cid:131)</s>', '▪', regex=True)
Only changes the part with cid:131:
csv_of_table = csv_of_table.replace('cid:131', '▪', regex=True)
Upvotes: 4
Views: 126
Reputation: 2579
You can use the ? quantifier to signify that a group can appear 0 or multiple times.
csv_of_table = csv_of_table.replace("(<s>\()?cid:\d+(\)<\/s>)?", "▪", regex = True)
Upvotes: 1