Mikhail_Sam
Mikhail_Sam

Reputation: 11218

Replace strings using List Comprehensions

Is it possible to do this example using List Comprehensions:

a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor', 
     'Orci varius natoque penatibus et magnis dis parturient montes']


for s in a:
    b = [el.replace(s,'') for el in b]

What I want is to delete specific words from list of sentences. I can do it using loop, but I suppose it is possible using some one-line solution.

I tried something like:

b = [[el.replace(s,'') for el in b] for s in a ]

but it goes wrong


I got a lot of quality answers, but now I have on more complication: what if I want to use combination of words?

a = ['test', 'smth commodo']

Thank you for a lot of answers! I made speed test for all the solutions and here is the result: I did it mean of 100 calculations (except the last one, it's too long to wait).

                      b=10 a=2   |  b=9000 a=2 | b=9000 a=100 | b=45k a=500
---------------------------------+-------------+--------------+---------------
COLDSPEED solution:   0.0000206  |  0.0311071  |  0.0943433   |  4.5012770
Jean Fabre solution:  0.0000871  |  0.1722340  |  0.2635452   |  5.2981001
Jpp solution:         0.0000212  |  0.0474531  |  0.0464369   |  0.2450547
Ajax solution:        0.0000334  |  0.0303891  |  0.5262040   | 11.6994496
Daniel solution:      0.0000167  |  0.0162156  |  0.1301132   |  6.9071504
Kasramvd solution:    0.0000120  |  0.0084146  |  0.1704623   |  7.5648351

We can see Jpp solution is the fastest BUT we can't use it - it's the one solution from all others which can't work on combination of words (I already wrote him and hope he will improve his answer!). So looks like the @cᴏʟᴅsᴘᴇᴇᴅ 's solution is the fastest on the big data sets.

Upvotes: 7

Views: 3412

Answers (7)

Ajax1234
Ajax1234

Reputation: 71451

Another possibility is to join all the word combinations, and then replace \s with | for re.sub:

import re
b = ['test Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'test Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'smth commodo eget tortor', 
 'Orci varius natoque penatibus et magnis dis parturient montes']
a = ['test', 'smth commodo']
replaced_strings = [re.sub(re.sub('\s', '|', ' '.join(a)), '', i) for i in b]

Output:

[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', '  eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

To remove additional whitespace, apply an additional pass:

new_data = [re.sub('^\s+', '', i) for i in replaced_strings]

Output:

['Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', 'Nulla lectus ligula', 'imperdiet at porttitor quis', 'eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

Upvotes: 1

Jean-François Fabre
Jean-François Fabre

Reputation: 140178

If the list is huge, building a ORed list of regular expressions (like "\btest\b|\bsmth\b") can be quite long if the list of words to remove is big (O(n)). regex tests the first word, then the second ...

I suggest you use a replacement function using a set for word lookup. Return the word itself if not found, else return nothing to remove the word:

a = {'test', 'smth'}
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor',
     'Orci varius natoque penatibus et magnis dis parturient montes']

import re

result = [re.sub(r"\b(\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in b]

print(result)

[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', ' commodo eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

Now if your list of "words" to replace contain strings composed of 2 words, this method doesn't work, because \w doesn't match spaces. A second pass could be done on the list of "words" made of 2 words:

a = {'lectus ligula', 'porttitor quis'}

and injecting the result in a similar filter but with explicit 2 word match:

result = [re.sub(r"\b(\w+ ?\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in result]

So 2 passes but if the list of words is huge, it's still faster than an exhaustive regex.

Upvotes: 3

Abdul Quddus
Abdul Quddus

Reputation: 111

You may be looking for this:

[el.replace(a[0],'').replace(a[1],'') for el in b]

And if you want to remove spaces as well then use strip()

[el.replace(a[0],'').replace(a[1],'').strip() for el in b]

Hope this helps...

Upvotes: 0

Daniel
Daniel

Reputation: 57

You could use map and a regular expression.

import re
a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor', 
     'Orci varius natoque penatibus et magnis dis parturient montes']

pattern=r'('+r'|'.join(a)+r')'
b=list(map(lambda x: re.sub(pattern,r'',x).strip(),b))

Upvotes: 1

Kasravnd
Kasravnd

Reputation: 107287

As a pure functional approach (mostly for educational sake) is to utilize partial and reduce functions from functools module along with a map to apply the replacer function on your list of strings.

In [48]: f = partial(reduce, lambda x, y: x.replace(y + ' ', ''), a)

In [49]: list(map(f, b))
Out[49]: 
['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes']

Also if number of items in a is not very large there's nothing wrong with repeating the replace() multiple times. In this case, a very optimized and straightforward way is to use two replace as following:

In [54]: [line.replace(a[0] + ' ', '').replace(a[1] + ' ', '') for line in b]
Out[54]: 
['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes']

Upvotes: 1

cs95
cs95

Reputation: 402493

There's nothing wrong with what you have, but if you want to clean things up a bit and performance isn't important, then compile a regex pattern and call sub inside a loop.

>>> import re
>>> p = re.compile(r'\b({})\b'.format('|'.join(a)))
>>> [p.sub('', text).strip() for text in b]

['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes'
]

Details
Your pattern will look something like this:

\b    # word-boundary - remove if you also want to replace substrings
(
test  # word 1
|     # regex OR pipe
smth  # word 2 ... you get the picture
)
\b    # end with another word boundary - again, remove for substr replacement

And this is the compiled regex pattern matcher:

>>> p
re.compile(r'\b(test|smth)\b', re.UNICODE)

Another consideration is whether your replacement strings themselves contain characters that could be interpreted by the regex engine differently - rather than being treated as literals - these are regex metacharacters, and you can escape them while building your pattern. That is done using re.escape.

p = re.compile(r'\b({})\b'.format(
    '|'.join([re.escape(word) for word in a]))
)

Of course, keep in mind that with larger data and more replacements, regex and string replacements both become tedious. Consider the use of something more suited to large operations, like flashtext.

Upvotes: 5

jpp
jpp

Reputation: 164673

This is an alternative way using set, str.join, str.split and str.strip.

a_set = set(a)

b = [[' '.join([word if word not in a_set else ''
                for word in item.split()]).strip()]
     for item in b]

# [['Lorem ipsum dolor sit amet'],
#  ['consectetur adipiscing elit'],
#  ['Nulla lectus ligula'],
#  ['imperdiet at porttitor quis'],
#  ['commodo eget tortor'],
#  ['Orci varius natoque penatibus et magnis dis parturient montes']]

Upvotes: 2

Related Questions