Reputation: 2443
I would like to extract data containing a specific character string in another column.
For example, The target extracted is like "another column string + 3 digits" character.
It has error. I would like to get TARGET row.
df = pd.DataFrame({'col1':['xxxx', 'yyyy', 'zzzz'],'col2':['xxxx123','yyyy1234','aaa123']})
col1 | col2
xxxx | xxxx123 <- TARGET
yyyy | yyyy1234 <- Not TARGET
zzzz | aaaa123 <- Not TARGET
This is my code which does not work.
print(df[df['col1'].str.match(df['col2'] + [0-9][0-9][0-9])])
I had tried str.contains and str.match and isin. Probably I don't figure out how to use them.
Please let me know how to do it.
Upvotes: 2
Views: 477
Reputation: 164683
You can filter by the intersection of two Boolean masks:
n = 3 # number of digits
mask1 = pd.to_numeric(df['col2'].str[-n:], errors='coerce').notnull()
mask2 = [col2[:-n] == col1 for col1, col2 in zip(df['col1'], df['col2'])]
df_slice = df[mask1 & mask2]
print(df_slice)
col1 col2
0 xxxx xxxx123
You will likely find regex expensive versus regular str
operations. Currently, Pandas str
methods are also efficient.
df = pd.DataFrame({'col1':['xxxx', 'yyyy', 'zzzz'],'col2':['xxxx123','yyyy1234','aaa123']})
def vai(df):
cond1 = df.col2.str.extract('([A-Za-z]+)\d', expand = False).eq(df.col1)
cond2 = df.col2.str.extract('[A-Za-z](\d{3})$', expand = False)
return cond1 & cond2
def jpp(df):
n = 3 # number of digits
mask1 = pd.to_numeric(df['col2'].str[-n:], errors='coerce').notnull()
mask2 = [col2[:-n] == col1 for col1, col2 in zip(df['col1'], df['col2'])]
return mask1 & mask2
def jpp2(df):
n = 3 # number of digits
mask1 = pd.to_numeric(df['col2'].str[-n:], errors='coerce').notnull()
mask2 = df['col2'].str[:-n] == df['col1']
return mask1 & mask2
df = pd.concat([df]*1000)
assert vai(df).equals(jpp(df)) and vai(df).equals(jpp2(df))
%timeit vai(df) # 17.3 ms per loop
%timeit jpp(df) # 5.4 ms per loop
%timeit jpp2(df) # 8.01 ms per loop
Upvotes: 1
Reputation: 38415
Two pattern matches, and filter the dataframe
cond1 = df.col2.str.extract('([A-Za-z]+)\d', expand = False).eq(df.col1)
cond2 = df.col2.str.extract('[A-Za-z](\d{3})$', expand = False)
df[(cond1) & (cond2)]
col1 col2
0 xxxx [email protected]
Upvotes: 3
Reputation: 1266
You can write a function that performs string matching or uses regular expressions and pass it to apply() method.
Upvotes: 0