Reputation: 11218
Is it possible to do this example using List Comprehensions:
a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'test Nulla lectus ligula',
'imperdiet at porttitor quis',
'smth commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes']
for s in a:
b = [el.replace(s,'') for el in b]
What I want is to delete specific words from list of sentences. I can do it using loop, but I suppose it is possible using some one-line solution.
I tried something like:
b = [[el.replace(s,'') for el in b] for s in a ]
but it goes wrong
I got a lot of quality answers, but now I have on more complication: what if I want to use combination of words?
a = ['test', 'smth commodo']
Thank you for a lot of answers! I made speed test for all the solutions and here is the result: I did it mean of 100 calculations (except the last one, it's too long to wait).
b=10 a=2 | b=9000 a=2 | b=9000 a=100 | b=45k a=500
---------------------------------+-------------+--------------+---------------
COLDSPEED solution: 0.0000206 | 0.0311071 | 0.0943433 | 4.5012770
Jean Fabre solution: 0.0000871 | 0.1722340 | 0.2635452 | 5.2981001
Jpp solution: 0.0000212 | 0.0474531 | 0.0464369 | 0.2450547
Ajax solution: 0.0000334 | 0.0303891 | 0.5262040 | 11.6994496
Daniel solution: 0.0000167 | 0.0162156 | 0.1301132 | 6.9071504
Kasramvd solution: 0.0000120 | 0.0084146 | 0.1704623 | 7.5648351
We can see Jpp solution is the fastest BUT we can't use it - it's the one solution from all others which can't work on combination of words (I already wrote him and hope he will improve his answer!). So looks like the @cᴏʟᴅsᴘᴇᴇᴅ 's solution is the fastest on the big data sets.
Upvotes: 7
Views: 3412
Reputation: 71451
Another possibility is to join all the word combinations, and then replace \s
with |
for re.sub
:
import re
b = ['test Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'test Nulla lectus ligula',
'imperdiet at porttitor quis',
'smth commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes']
a = ['test', 'smth commodo']
replaced_strings = [re.sub(re.sub('\s', '|', ' '.join(a)), '', i) for i in b]
Output:
[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', ' eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']
To remove additional whitespace, apply an additional pass:
new_data = [re.sub('^\s+', '', i) for i in replaced_strings]
Output:
['Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', 'Nulla lectus ligula', 'imperdiet at porttitor quis', 'eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']
Upvotes: 1
Reputation: 140178
If the list is huge, building a ORed list of regular expressions (like "\btest\b|\bsmth\b"
) can be quite long if the list of words to remove is big (O(n)
). regex tests the first word, then the second ...
I suggest you use a replacement function using a set
for word lookup. Return the word itself if not found, else return nothing to remove the word:
a = {'test', 'smth'}
b = ['test Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'test Nulla lectus ligula',
'imperdiet at porttitor quis',
'smth commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes']
import re
result = [re.sub(r"\b(\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in b]
print(result)
[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', ' commodo eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']
Now if your list of "words" to replace contain strings composed of 2 words, this method doesn't work, because \w
doesn't match spaces. A second pass could be done on the list of "words" made of 2 words:
a = {'lectus ligula', 'porttitor quis'}
and injecting the result
in a similar filter but with explicit 2 word match:
result = [re.sub(r"\b(\w+ ?\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in result]
So 2 passes but if the list of words is huge, it's still faster than an exhaustive regex.
Upvotes: 3
Reputation: 111
You may be looking for this:
[el.replace(a[0],'').replace(a[1],'') for el in b]
And if you want to remove spaces as well then use strip()
[el.replace(a[0],'').replace(a[1],'').strip() for el in b]
Hope this helps...
Upvotes: 0
Reputation: 57
You could use map and a regular expression.
import re
a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'test Nulla lectus ligula',
'imperdiet at porttitor quis',
'smth commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes']
pattern=r'('+r'|'.join(a)+r')'
b=list(map(lambda x: re.sub(pattern,r'',x).strip(),b))
Upvotes: 1
Reputation: 107287
As a pure functional approach (mostly for educational sake) is to utilize partial
and reduce
functions from functools
module along with a map
to apply the replacer function on your list of strings.
In [48]: f = partial(reduce, lambda x, y: x.replace(y + ' ', ''), a)
In [49]: list(map(f, b))
Out[49]:
['Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'Nulla lectus ligula',
'imperdiet at porttitor quis',
'commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes']
Also if number of items in a
is not very large there's nothing wrong with repeating the replace()
multiple times. In this case, a very optimized and straightforward way is to use two replace
as following:
In [54]: [line.replace(a[0] + ' ', '').replace(a[1] + ' ', '') for line in b]
Out[54]:
['Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'Nulla lectus ligula',
'imperdiet at porttitor quis',
'commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes']
Upvotes: 1
Reputation: 402493
There's nothing wrong with what you have, but if you want to clean things up a bit and performance isn't important, then compile a regex pattern and call sub
inside a loop.
>>> import re
>>> p = re.compile(r'\b({})\b'.format('|'.join(a)))
>>> [p.sub('', text).strip() for text in b]
['Lorem ipsum dolor sit amet',
'consectetur adipiscing elit',
'Nulla lectus ligula',
'imperdiet at porttitor quis',
'commodo eget tortor',
'Orci varius natoque penatibus et magnis dis parturient montes'
]
Details
Your pattern will look something like this:
\b # word-boundary - remove if you also want to replace substrings
(
test # word 1
| # regex OR pipe
smth # word 2 ... you get the picture
)
\b # end with another word boundary - again, remove for substr replacement
And this is the compiled regex pattern matcher:
>>> p
re.compile(r'\b(test|smth)\b', re.UNICODE)
Another consideration is whether your replacement strings themselves contain characters that could be interpreted by the regex engine differently - rather than being treated as literals - these are regex metacharacters, and you can escape them while building your pattern. That is done using re.escape
.
p = re.compile(r'\b({})\b'.format(
'|'.join([re.escape(word) for word in a]))
)
Of course, keep in mind that with larger data and more replacements, regex and string replacements both become tedious. Consider the use of something more suited to large operations, like flashtext
.
Upvotes: 5
Reputation: 164673
This is an alternative way using set
, str.join
, str.split
and str.strip
.
a_set = set(a)
b = [[' '.join([word if word not in a_set else ''
for word in item.split()]).strip()]
for item in b]
# [['Lorem ipsum dolor sit amet'],
# ['consectetur adipiscing elit'],
# ['Nulla lectus ligula'],
# ['imperdiet at porttitor quis'],
# ['commodo eget tortor'],
# ['Orci varius natoque penatibus et magnis dis parturient montes']]
Upvotes: 2