Timo Kvamme
Timo Kvamme

Reputation: 2964

Pandas check which substring is in column of strings

Im trying to create function which will create a new column in a pandas dataframe, where it figures out which substring is in a column of strings and takes the substring and uses that for the new column.

The problem being that the text to find does not appear at the same location in variable x

 df = pd.DataFrame({'x': ["var_m500_0_somevartext","var_m500_0_vartextagain",
 "varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6,8]})

finds = ["m500_0","0_500","m150_0"]

which of finds is in a given df["x"] row

I've made a function that works, but is terribly slow for large datasets

def pd_create_substring_var(df,new_var_name = "new_var",substring_list=["1"],var_ori="x"):
    import re
    df[new_var_name] = "na"
    cols =  list(df.columns)
    for ix in range(len(df)):
        for find in substring_list:
            for m in re.finditer(find, df.iloc[ix][var_ori]):
                df.iat[ix, cols.index(new_var_name)] = df.iloc[ix][var_ori][m.start():m.end()]
    return df


df = pd_create_substring_var(df,"t",finds,var_ori="x")

df 
                            x  x1       t
0      var_m500_0_somevartext   4  m500_0
1     var_m500_0_vartextagain   5  m500_0
2  varwithsomeothertext_0_500   6   0_500
3   varwithsomext_m150_0_text   8  m150_0

Upvotes: 0

Views: 1953

Answers (5)

Shehan Ishanka
Shehan Ishanka

Reputation: 593

Try this

df["t"] = df["x"].apply(lambda x: [i for i in finds if i in x][0])

Upvotes: 0

Rajat Jain
Rajat Jain

Reputation: 2022

Use pandas.str.findall:

df['x'].str.findall("|".join(finds))

0    [m500_0]
1    [m500_0]
2     [0_500]
3    [m150_0]

Upvotes: 2

Zinnia Razia
Zinnia Razia

Reputation: 39

I don't know how large your dataset is, but you can use map function like below:

def subset_df_test():
  df = pandas.DataFrame({'x': ["var_m500_0_somevartext", "var_m500_0_vartextagain",
                         "varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6, 8]})

  finds = ["m500_0", "0_500", "m150_0"]
  df['t'] = df['x'].map(lambda x: compare(x, finds))
  print df

def compare(x, finds):
  for f in finds:
    if f in x:
        return f

Upvotes: 1

U13-Forward
U13-Forward

Reputation: 71560

Probably not the best way:

df['t'] = df['x'].apply(lambda x: ''.join([i for i in finds if i in x]))

And now:

print(df)

Is:

                            x  x1       t
0      var_m500_0_somevartext   4  m500_0
1     var_m500_0_vartextagain   5  m500_0
2  varwithsomeothertext_0_500   6   0_500
3   varwithsomext_m150_0_text   8  m150_0

And now, just adding to @pythonjokeun's answer, you can do:

df["t"] = df["x"].str.extract("(%s)" % '|'.join(finds))

Or:

df["t"] = df["x"].str.extract("({})".format('|'.join(finds)))

Or:

df["t"] = df["x"].str.extract("(" + '|'.join(finds) + ")")

Upvotes: 1

pythonjokeun
pythonjokeun

Reputation: 431

Does this accomplish what you need ?

finds = ["m500_0", "0_500", "m150_0"]
df["t"] = df["x"].str.extract(f"({'|'.join(finds)})")

Upvotes: 3

Related Questions