Reputation: 59
I have a text string, and have identified a set of words which I want to wrap with []. I have stored these words in an array, and have also stored the index position of their first and last characters in each object as well.
How can I append [] to either side of these words in Python?
Here is an example of a text string I am extracting words from:
"The SARs were leaked to the Buzzfeed website and shared with the International Consortium of Investigative Journalists (ICIJ). Panorama led the research for the BBC as part of a global probe. The ICIJ led the reporting of the Panama Papers and Paradise Papers leaks - secret files detailing the offshore activities of the wealthy and the famous. Fergus Shiel, from the consortium, said the FinCEN Files are an insight into what banks know about the vast flows of dirty money across the globe… [The] system that is meant to regulate the flows of tainted money is broken. The leaked SARs had been submitted to the US Financial Crimes Enforcement Network, or FinCEN between 2000 and 2017 and cover transactions worth about $2 trillion. FinCEN said the leak could impact US national security, risk investigations, and threaten the safety of those who file the reports. But last week it announced proposals to overhaul its anti-money laundering programmes. The UK also unveiled plans to reform its register of company information to clamp down on fraud and money laundering.The investment scam that HSBC was warned about was called WCM777. It led to the death of investor Reynaldo Pacheco, who was found under water on a wine estate in Napa, California, in April 2014. Police say he had been bludgeoned with rocks. He signed up to the scheme and was expected to recruit other investors. The promise was everyone would get rich. A woman Mr Pacheco, 44, introduced lost about $3,000. That led to the killing by men hired to kidnap him. He literally was trying to… make people's lives better, and he himself was scammed, and conned, and he unfortunately paid for it with his life,said Sgt Chris Pacheco (no relation), one of the officers who investigated the killing. Reynaldo, he said, was murdered for being a victim in a Ponzi scheme."
Here is an example of what the array of words I am looking to append square brackets to looks like:
[('Buzzfeed', 28, 36, 'ORG'), ('International Consortium of Investigative Journalists', 61, 118, 'ORG'), ('Panorama', 127, 135, 'ORG'), ('BBC', 161, 164, 'ORG'), ('Panama Papers', 222, 239, 'ORG'), ('Fergus Shiel', 346, 358, 'PERSON'), ('Files', 397, 402, 'PRODUCT'), ('US Financial Crimes Enforcement Network', 608, 651, 'ORG'), ('FinCEN', 733, 739, 'ORG'), ('US', 767, 769, 'GPE'), ('last week', 869, 878, 'DATE'), ('UK', 956, 958, 'GPE'), ('HSBC', 1094, 1098, 'ORG'), ('Reynaldo Pacheco', 1167, 1183, 'PERSON'), ('Napa', 1231, 1235, 'GPE'), ('California', 1237, 1247, 'GPE'), ('April 2014', 1252, 1262, 'DATE'), ('Mr Pacheco', 1431, 1441, 'PERSON'), ('44', 1443, 1445, 'DATE'), ('Sgt Chris Pacheco', 1677, 1694, 'PERSON')]
Upvotes: 0
Views: 3604
Reputation: 18148
If you can make certain assumptions about your data, then here's a very simple version which is probably what my first attempt would look like:
text = "The SARs were leaked..."
keywords_indexed = [('Buzzfeed', 28, 36, 'ORG'), ...]
# Construct a set of keywords that we want to bracket
words_to_bracket = set(k[0] for k in keywords_indexed)
# Replace every instance of a word-to-be-bracketed
bracketed_text = text
for word in words_to_bracket:
bracketed_text = bracketed_text.replace(word, "[{}]".format(word))
print(bracketed_text)
Pros: It's simple, easy to understand, maintainable.
Cons: It's quite inefficient, but that may not matter unless you're handling very large chunks of text and have to do it fast.
Only you can decide which tradeoffs to make. Just wanted to offer you a nice, clean version to choose from!
Output of above code on the OP's sample input:
The SARs were leaked to the [Buzzfeed] website and shared with the [International Consortium of Investigative Journalists] (ICIJ). [Panorama] led the research for the [BBC] as part of a global probe. The ICIJ led the reporting of the [Panama Papers] and Paradise Papers leaks - secret files detailing the offshore activities of the wealthy and the famous. [Fergus Shiel], from the consortium, said the [FinCEN] [Files] are an insight into what banks know about the vast flows of dirty money across the globe… [The] system that is meant to regulate the flows of tainted money is broken. The leaked SARs had been submitted to the [US] Financial Crimes Enforcement Network, or [FinCEN] between 2000 and 2017 and cover transactions worth about $2 trillion. [FinCEN] said the leak could impact [US] national security, risk investigations, and threaten the safety of those who file the reports. But [last week] it announced proposals to overhaul its anti-money laundering programmes. The [UK] also unveiled plans to reform its register of company information to clamp down on fraud and money laundering.The investment scam that [HSBC] was warned about was called WCM777. It led to the death of investor [Reynaldo Pacheco], who was found under water on a wine estate in [Napa], [California], in [April 2014]. Police say he had been bludgeoned with rocks. He signed up to the scheme and was expected to recruit other investors. The promise was everyone would get rich. A woman [Mr Pacheco], [44], introduced lost about $3,000. That led to the killing by men hired to kidnap him. He literally was trying to… make people's lives better, and he himself was scammed, and conned, and he unfortunately paid for it with his life,said [Sgt Chris Pacheco] (no relation), one of the officers who investigated the killing. Reynaldo, he said, was murdered for being a victim in a Ponzi scheme.
Upvotes: 0
Reputation: 10624
The following should work: l is your original list and t is yout text:
l=[list(i) for i in l]
for i in range(len(l)):
x1, x2=l[i][1], l[i][2]
t=t[:x1]+ '[' + t[x1:x2] + ']' +t[x2:]
for k in range(i+1, len(l)):
l[k][1]+=2
l[k][2]+=2
This gives the following Output:
"The SARs were leaked to the [Buzzfeed] website and shared with [the International Consortium of Investigative Journalists] (ICIJ). [Panorama] led the research for the [BBC] as part of a global probe. The ICIJ led the reporting of [the Panama Papers] and Paradise Papers leaks - secret files detailing the offshore activities of the wealthy and the famous. [Fergus Shiel], from the consortium, said the FinCEN [Files] are an insight into what banks know about the vast flows of dirty money across the globe… [The] system that is meant to regulate the flows of tainted money is broken. The leaked SARs had been submitted to [the US Financial Crimes Enforcement Network], or FinCEN between 2000 and 2017 and cover transactions worth about $2 trillion. [FinCEN] said the leak could impact [US] national security, risk investigations, and threaten the safety of those who file the reports. But [last week] it announced proposals to overhaul its anti-money laundering programmes. The [UK] also unveiled plans to reform its register of company information to clamp down on fraud and money laundering.The investment scam that [HSBC] was warned about was called WCM777. It led to the death of investor [Reynaldo Pacheco], who was found under water on a wine estate in [Napa], [California], in [April 2014]. Police say he had been bludgeoned with rocks. He signed up to the scheme and was expected to recruit other investors. The promise was everyone would get rich. A woman [Mr Pacheco], [44], introduced lost about $3,000. That led to the killing by men hired to kidnap him. He literally was trying to… make people's lives better, and he himself was scammed, and conned, and he unfortunately paid for it with his life,said [Sgt Chris Pacheco] (no relation), one of the officers who investigated the killing. Reynaldo, he said, was murdered for being a victim in a Ponzi scheme."
Upvotes: 0
Reputation: 147166
If you sort your list of phrases (I've called it words
) in reverse order, you can insert the [
and ]
around each phrase in a loop. The reason you need to do it backwards is because the insertion will change the indexes of subsequent characters in the string:
for w in sorted(words, key=lambda x:-x[1]):
text = text[:w[1]] + '[' + text[w[1]:w[2]] + ']' + text[w[2]:]
print(text)
Output:
The SARs were leaked to the [Buzzfeed] website and shared with [the International Consortium of Investigative Journalists] (ICIJ). [Panorama] led the research for the [BBC] as part of a global probe. The ICIJ led the reporting of [the Panama Papers] and Paradise Papers leaks - secret files detailing the offshore activities of the wealthy and the famous. [Fergus Shiel], from the consortium, said the FinCEN [Files] are an insight into what banks know about the vast flows of dirty money across the globe… [The] system that is meant to regulate the flows of tainted money is broken. The leaked SARs had been submitted to [the US Financial Crimes Enforcement Network], or FinCEN between 2000 and 2017 and cover transactions worth about $2 trillion. [FinCEN] said the leak could impact [US] national security, risk investigations, and threaten the safety of those who file the reports. But [last week] it announced proposals to overhaul its anti-money laundering programmes. The [UK] also unveiled plans to reform its register of company information to clamp down on fraud and money laundering.The investment scam that [HSBC] was warned about was called WCM777. It led to the death of investor [Reynaldo Pacheco], who was found under water on a wine estate in [Napa], [California], in [April 2014]. Police say he had been bludgeoned with rocks. He signed up to the scheme and was expected to recruit other investors. The promise was everyone would get rich. A woman [Mr Pacheco], [44], introduced lost about $3,000. That led to the killing by men hired to kidnap him. He literally was trying to… make people's lives better, and he himself was scammed, and conned, and he unfortunately paid for it with his life,said [Sgt Chris Pacheco] (no relation), one of the officers who investigated the killing. Reynaldo, he said, was murdered for being a victim in a Ponzi scheme.
Upvotes: 2