Reputation: 337
I am creating new Boolean columns based on whether or not other columns contain certain string values.
This is what I have tried
def function(data):
data["col1"] = 0
data["col2"] = 0
data["col3"] = 0
for i in range(0,len(data)):
if ("cond1" in data.loc[i,"old_col1"].lower()) or ("cond2" in data.loc[i,"old_col1"].lower()):
data.loc[i,"col1"] = 1
elif ("cond3 " in data.loc[i,"old_col1"].lower()) or ("cond4 " in data.loc[i,"old_col2"].lower()):
data.loc[i,"col2"] = 1
elif ("cond5 " in in data.loc[i,"old_col1"].lower()) or ("cond6 " in data.loc[i,"old_col3"].lower()):
data.loc[i, "col3"] = 1
function(data)
But it doesn't scale well to larger datasets.
Are there better ways to implement Boolean columns, col1-3, that would perform faster?
Upvotes: 1
Views: 1598
Reputation: 32224
I made an example data frame, as you did not provide one
col1 col2 col3
0 foo cucumber HogsWatch
1 bar selery hogswatch
2 baz Porcupine Watch Hogs
You can use apply to get a function to work over a whole dataframe
df.apply(lambda x: x.str.contains('A', flags=re.IGNORECASE))
col1 col2 col3
0 False False True
1 True False True
2 True False True
This means that you can produce a new data frame with the boolean columns and if you want, you can join it into the original dataframe
bool_df = df.apply(lambda x: x.str.contains('A', flags=re.IGNORECASE))
df = df.merge(bool_df, left_index=True, right_index=True, suffixes=['', '_bool'])
col1 col2 col3 col1_bool col2_bool col3_bool
0 foo cucumber HogsWatch False False True
1 bar selery hogswatch True False True
2 baz Porcupine Watch Hogs True False True
Of course you can make more complicated regexes in str.contains
, such as
df.apply(lambda x: x.str.contains('A|O', flags=re.IGNORECASE))
col1 col2 col3
0 True False True
1 True False True
2 True True True
I notice that you have different conditions for each column, this is also achievable with this methodology, but it is a bit more complicated, still fast though.
First we create a data frame of all the actual matching strings
conditions = {"col1": ["ar", "f"], "col2": ["er", "c"], "col3": ["Hog", " "]}
for col_name, strings in conditions:
regex = "(" + ")|(".join(strings) + ")"
df_cond = df[col_name].str.extract(regex, flags=re.IGNORECASE).notnull()
df[col_name + '_matches'] = df_cond.T.max().T
Produces
col1 col2 col3 col1_matches col2_matches col3_matches
0 foo cucumber HogsWatch True True True
1 bar selery hogswatch True True True
2 baz Porcupine Watch Hogs True False True
Upvotes: 1
Reputation: 937
You should do something like this:
data["col1"] = 0
data["col2"] = 0
data["col3"] = 0
data.loc[data["old_col1"].str.lower().isin(["cond1", "cond2"]), 'col1'] = 1
data.loc[data["old_col2"].str.lower().isin(["cond3", "cond4"]), 'col2'] = 1
data.loc[data["old_col3"].str.lower().isin(["cond5", "cond6"]), 'col3'] = 1
Upvotes: 0
Reputation: 441
This can be simplified as
data['col1'] = data['old_col1'].apply(lower)=='cond1' | data['old_col1'].apply(lower)=='cond2'
or
data['col1'] = data['old_col1'].apply(lower).isin(['cond1','cond2'])
Upvotes: 0