Tirbo06
Tirbo06

Reputation: 773

Return list of strings if any substring is present in another list

Let's say you have a company's information like this:

companies = [
    ['zmpEVqsbCUO1aXStxHkSVA', 'palms-car-wash'],
    ['5T0vKfIJWP1xTnxA7fJ17w', 'meat-and-bread'],
    ['C0d5kzUx6C19mLcxQyhxCA', 'alamo-drafthouse-cinema-'],
    ['ch1ercqwoNLpQLxpTb90KQ', 'boston-tea-stop']
]

Let's say you want to exclude some business if any string/substring of a list is present in some information of the list above:

no_interest = ['museum', 'cinema', 'car']

I have done this, (we only look in the 2nd column of every entry):

# KEEPING ONLY RESULTS WHERE WE DO NOT FIND THE SUBSTRINGS
[x for x in companies if (no_interest[0] not in x[1]) & (no_interest[1] not in x[1]) & (no_interest[2] not in x[1])]

# Returns
[['5T0vKfIJWP1xTnxA7fJ17w', 'meat-and-bread'],
 ['ch1ercqwoNLpQLxpTb90KQ', 'boston-tea-stop']]

It seems to work even if I would prefer it to work with an 'OR' statement instead of an 'AND' (&) which for me is a cumulative operator and should be working if ALL the conditions are met ('museum', 'cinema' and 'car' in the same string)

Why is the 'AND' statement acting like an 'OR'? How can we make this code more pythonic and more efficient?

We only check for 3 substrings here but it is more and more about thousands of occurrences we are looking for and it will be great to not repeat those conditions but have something more like an all() or any() statement that returns results and not a boolean.

Upvotes: 1

Views: 2107

Answers (2)

hpchavaz
hpchavaz

Reputation: 1388

Here is another one using regex, but (as Henry Ecker's pandas answer) its assumes that there is no interfering regex special character in any of the 'no_interest' elements

import regex as re
pattern = re.compile("|".join(no_interest))
out = [c for c in companies if ((pattern.search(c[0]) == None) and (pattern.search(c[1]) == None))]

Upvotes: 0

Henry Ecker
Henry Ecker

Reputation: 35626

Why is the 'AND' statement acting like a 'OR'?

See: DeMorgan's Laws

DeMorgan's Law

How can we make this code more pythonic and more efficient?

More pythonic:

One options is to use all on a separate list comprehension:

companies = [['zmpEVqsbCUO1aXStxHkSVA', 'palms-car-wash'],
             ['5T0vKfIJWP1xTnxA7fJ17w', 'meat-and-bread'],
             ['C0d5kzUx6C19mLcxQyhxCA', 'alamo-drafthouse-cinema-'],
             ['ch1ercqwoNLpQLxpTb90KQ', 'boston-tea-stop']]

no_interest = ['museum', 'cinema', 'car']

out = [x for x in companies if all([ni not in x[1] for ni in no_interest])]
print(out)

Or with not any:

out = [x for x in companies if not any([ni in x[1] for ni in no_interest])]
[['5T0vKfIJWP1xTnxA7fJ17w', 'meat-and-bread'],
 ['ch1ercqwoNLpQLxpTb90KQ', 'boston-tea-stop']]

More efficient:

Use a library like pandas:

import pandas as pd

companies = [['zmpEVqsbCUO1aXStxHkSVA', 'palms-car-wash'],
             ['5T0vKfIJWP1xTnxA7fJ17w', 'meat-and-bread'],
             ['C0d5kzUx6C19mLcxQyhxCA', 'alamo-drafthouse-cinema-'],
             ['ch1ercqwoNLpQLxpTb90KQ', 'boston-tea-stop']]

df = pd.DataFrame(data=companies, columns=['id', 'val'])

no_interest = ['museum', 'cinema', 'car']

out = df[~df['val'].str.contains('|'.join(no_interest))]
print(out)

Output as DataFrame

                       id              val
1  5T0vKfIJWP1xTnxA7fJ17w   meat-and-bread
3  ch1ercqwoNLpQLxpTb90KQ  boston-tea-stop

Output as list

print(out.to_numpy().tolist())
[['5T0vKfIJWP1xTnxA7fJ17w', 'meat-and-bread'],
 ['ch1ercqwoNLpQLxpTb90KQ', 'boston-tea-stop']]

Upvotes: 3

Related Questions