Reputation: 585
I have a data frame (df)
df = pd.DataFrame({'No': [123,234,345,456,567,678], 'text': ['60 ABC','1nHG','KL HG','21ABC','K 200','1g HG'], 'reference':['ABC','HG','FL','','200',''], 'result':['','','','','','']}, columns=['No', 'text', 'reference', 'result'])
No text reference result
0 123 60 ABC ABC
1 234 1nHG HG
2 345 KL HG FL
3 456 21ABC
4 567 K 200 200
5 678 1g HG
and a list with elements
list
['ABC','HG','FL','200','CP1']
Now I have the following coding:
for idx, row in df.iterrows():
for item in list:
if row['text'].strip().endswith(item):
if pd.isnull(row['reference']):
df.at[idx, 'result'] = item
elif pd.notnull(row['reference']) and row['reference'] != item:
df.at[idx, 'result'] = 'wrong item'
if pd.isnull(row['result']):
break
I run through df and the list and check for matches.
Output:
No text reference result
0 123 60 ABC ABC
1 234 1nHG HG
2 345 KL HG FL wrong item
3 456 21ABC ABC
4 567 K 200 200
5 678 1g HG HG
The break instruction is important because otherwise a second element could be found within the list and then this second element would overwrite the content in result.
Now I need another solution because the data frame is huge and for loops are inefficient. Think using apply could work but how?
Thank you!
Upvotes: 1
Views: 1958
Reputation: 164693
Instead of iterating rows, you can iterate your suffixes, which is likely a much smaller iterable. This way, you can take advantage of series-based methods and Boolean indexing.
I've also created an extra series to identify when a row has been updated. The cost of this extra check should be small versus the expense of iterating by row.
L = ['ABC', 'HG', 'FL', '200', 'CP1']
df['text'] = df['text'].str.strip()
null = df['reference'].eq('')
df['updated'] = False
for item in L:
ends = df['text'].str.endswith(item)
diff = df['reference'].ne(item)
m1 = ends & null & ~df['updated']
m2 = ends & diff & ~null & ~df['updated']
df.loc[m1, 'result'] = item
df.loc[m2, 'result'] = 'wrong item'
df.loc[m1 | m2, 'updated'] = True
Result:
No text reference result updated
0 123 60 ABC ABC False
1 234 1nHG HG False
2 345 KL HG FL wrong item True
3 456 21ABC ABC True
4 567 K 200 200 False
5 678 1g HG HG True
You can drop the final column, but you may find it useful for other purposes.
Upvotes: 2