Reputation: 559
Suppose I have a DataFrame such as:
col1 col2
0 1 A
1 2 B
2 6 A
3 5 C
4 9 C
5 3 A
6 5 B
And multiple lists such as:
list_1 = [1, 2, 4]
list_2 = [3, 8]
list_3 = [5, 6, 7, 9]
I can update the value of col2
depending on whether the value of col1
is included in a list, for example:
for i in list_1:
df.loc[df.col1 == i, 'col2'] = 'A'
for i in list_2:
df.loc[df.col1 == i, 'col2'] = 'B'
for i in list_3:
df.loc[df.col1 == i, 'col2'] = 'C'
However this is very slow. With a dataframe of 30,000 rows, and each list containing approx 5,000-10,000 items, it can take a long time to calculate, especially compared to other pandas operations. Is there a better (faster) way of doing this?
Upvotes: 5
Views: 2343
Reputation: 12068
I would suggest iterating through your lists with a dictionary using conditional updating:
# Create your update dictionary
col_dict = {
"A":[1, 2, 4],
"B":[3, 8],
"C":[5, 6, 7, 9]
}
# Iterate and update
for key, value in col_dict.items():
# key is the col name; value is the lookup list
df["col2"] = np.where(df["col1"].isin(value), key, df["col2"])
There is a concern of overwriting values – since a row can technically match multiple lists. How those updates are reconciled is not obvious.
If rows don't match multiple keys, consider a dynamic programming approach where a running index of "unmatched" rows are used for each iteration, updating as your proceed so that the number of rows you're iterating through are fewer with each iteration.
Upvotes: 1
Reputation: 75080
You can use isin
with np.select
here:
df['col2'] = (np.select([df['col1'].isin(list_1),
df['col1'].isin(list_2),
df['col1'].isin(list_3)]
,['A','B','C']))
With Map
:
d = dict(zip(map(tuple,[list_1,list_2,list_3]),['A','B','C']))
df['col2'] = df['col1'].map({val: v for k,v in d.items() for val in k})
col1 col2
0 1 A
1 2 A
2 6 C
3 5 C
4 9 C
5 3 B
6 5 C
Upvotes: 6
Reputation: 19947
You can first convert the lists to dicts and then map to col1.
d1 = {k:'A' for k in list_1}
d2 = {k:'B' for k in list_2}
d3 = {k:'C' for k in list_3}
df['col2'] = (
df.col1.apply(lambda x: d1.get(x,x))
.combine_first(df.col1.apply(lambda x: d2.get(x,x)))
.combine_first(df.col1.apply(lambda x: d2.get(x,x)))
)
If there is no duplicates in the lists, you can make it even faster by merging them to a single dict:
d = {**{k:'A' for k in list_1},
**{k:'B' for k in list_2},
**{k:'C' for k in list_3}}
df['col2'] = df.col1.apply(lambda x: d.get(x,x))
Upvotes: 4