Steele Farnsworth
Steele Farnsworth

Reputation: 893

Pandas: Determine if a string in one column is a substring of a string in another column

Consider these series:

>>> a = pd.Series('abc a abc c'.split())
>>> b = pd.Series('a abc abc a'.split())
>>> pd.concat((a, b), axis=1)
     0    1
0  abc    a
1    a  abc
2  abc  abc
3    c    a

>>> unknown_operation(a, b)
0 False
1 True
2 True
3 False

The desired logic is to determine if the string in the left column is a substring of the string in the right column. pd.Series.str.contains does not accept another Series, and pd.Series.isin checks if the value exists in the other series (not in the same row specifically). I'm interested to know if there's a vectorized solution (not using .apply or a loop), but it may be that there isn't one.

Upvotes: 1

Views: 1033

Answers (3)

fogx
fogx

Reputation: 1810

I tested various functions with a randomly generated Dataframe of 1,000,000 5 letter entries.

Running on my machine, the averages of 3 tests showed:

zip > v_find > to_list > any > apply

0.21s > 0.79s > 1s > 3.55s > 8.6s

Hence, i would recommend using zip:

[x[0] in x[1] for x in zip(df['A'], df['B'])]

or vectorized find (as proposed by BENY)

np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1

My test-setup:

    def generate_string(length):
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

A = [generate_string(5) for x in range(n)]
B = [generate_string(5) for y in range(n)]
df = pd.DataFrame({"A": A, "B": B})

to_list = pd.Series([a in b for a, b in df[['A', 'B']].values.tolist()])

apply = df.apply(lambda s: s["A"] in s["B"], axis=1)

v_find = np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1

any = df["B"].str.split('', expand=True).eq(df["A"], axis=0).any(axis=1) | df["B"].eq(df["A"])

zip = [x[0] in x[1] for x in zip(df['A'], df['B'])]

Upvotes: 1

BENY
BENY

Reputation: 323226

Let us try with numpy defchararray which is vectorized

from numpy.core.defchararray import find
find(df['1'].values.astype(str),df['0'].values.astype(str))!=-1
Out[740]: array([False,  True,  True, False])

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153460

IIUC,

df[1].str.split('', expand=True).eq(df[0], axis=0).any(axis=1) | df[1].eq(df[0])

Output:

0    False
1     True
2     True
3    False
dtype: bool

Upvotes: 1

Related Questions