Reputation: 21
Let's say I have a list
lst = ["fi", "ap", "ko", "co", "ex"]
and we have this series
Explanation
a "fi doesn't work correctly"
b "apples are cool"
c "this works but translation is ko"
and I'm looking to get something like this:
Explanation Explanation Extracted
a "fi doesn't work correctly" "fi"
b "apples are cool" "N/A"
c "this works but translation is ko" "ko"
Upvotes: 0
Views: 3598
Reputation: 13582
Option 1
Assuming that one wants to extract the exact string in the list lst
one can start by creating a regex
regex = f'\\b({"|".join(lst)})\\b'
where \b
is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap
in the list lst
, if one has the word apple
in the dataframe, that won't be considered.
And then, using pandas.Series.str.extract
, and, to make it case insensitive, use re.IGNORECASE
import re
df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool NaN
2 3 this works but translation is ko ko
Option 2
One can also use pandas.Series.apply
with a custom lambda function.
df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool N/A
2 3 this works but translation is ko ko
Notes:
.lower()
is to make it case insensitive.
.split()
is one way to prevent that even though ap
is in the list, the string apple
doesn't appear in the Explanation Extracted
column.
Upvotes: 0
Reputation: 11321
With a dataframe like
df = pd.DataFrame(
{"Explanation": ["fi doesn't co work correctly",
"apples are cool",
"this works but translation is ko"]},
index=["a", "b", "c"]
)
you can use .str.extract()
to do
lst = ["fi", "ap", "ko", "co", "ex"]
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)
to get
Explanation Explanation Extracted
a fi doesn't co work correctly fi
b apples are cool NaN
c this works but translation is ko ko
The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
looks for an occurrence of one of the lst
items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract()
extracts the capture group (the part in the middle in ()
). Without a match the return is NaN
.
If you want to extract multiple matches, you could use .str.findall()
and then ", ".join
the results:
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)
Alternative without regex:
df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)
If you only want to match at the beginning or end of the sentences, then replace the first part with:
df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
(splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...
Upvotes: 1
Reputation: 851
apply
function of Pandas might be helpful
def extract_explanation(dataframe):
custom_substring = ["fi", "ap", "ko", "co", "ex"]
substrings = dataframe['explanation'].split(" ")
explanation = "N/A"
for string in substrings:
if string in custom_substring:
explanation = string
return explanation
df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)
The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.
Upvotes: 0
Reputation: 416
I think this solves your problem.
import pandas as pd
lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])
extracted =[]
for index, row in df.iterrows():
tempList =[]
rowSplit = row['Explanation'].split(" ")
for val in rowSplit:
if val in lst:
tempList.append(val)
if len(tempList)>0:
extracted.append(','.join(tempList))
else:
extracted.append('N/A')
df['Explanation Extracted'] = extracted
Upvotes: 0