Reputation: 21

How to extract strings from a list in a column in a python pandas dataframe?

Let's say I have a list

lst = ["fi", "ap", "ko", "co", "ex"]

and we have this series

       Explanation 

a      "fi doesn't work correctly" 
b      "apples are cool" 
c      "this works but translation is ko"

and I'm looking to get something like this:

        Explanation                         Explanation Extracted

a      "fi doesn't work correctly"          "fi"
b      "apples are cool"                    "N/A"
c      "this works but translation is ko"   "ko"

Upvotes: 0

Answers (4)

Gonçalo Peres

Reputation: 13582

Option 1

Assuming that one wants to extract the exact string in the list lst one can start by creating a regex

regex = f'\\b({"|".join(lst)})\\b'

where \b is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap in the list lst, if one has the word apple in the dataframe, that won't be considered.

And then, using pandas.Series.str.extract, and, to make it case insensitive, use re.IGNORECASE

import re

df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)

[Out]:
   ID                       Explanation Explanation Extracted
0   1         fi doesn't work correctly                    fi
1   2                 cap ples are cool                   NaN
2   3  this works but translation is ko                    ko

Option 2

One can also use pandas.Series.apply with a custom lambda function.

df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))

[Out]:
   ID                       Explanation Explanation Extracted
0   1         fi doesn't work correctly                    fi
1   2                 cap ples are cool                   N/A
2   3  this works but translation is ko                    ko

Notes:

.lower() is to make it case insensitive.
.split() is one way to prevent that even though ap is in the list, the string apple doesn't appear in the Explanation Extracted column.

Upvotes: 0

Timus

Reputation: 11321

With a dataframe like

df = pd.DataFrame(
    {"Explanation": ["fi doesn't co work correctly",
                     "apples are cool",
                     "this works but translation is ko"]},
    index=["a", "b", "c"]
)

you can use .str.extract() to do

lst = ["fi", "ap", "ko", "co", "ex"]

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)

to get

                        Explanation Explanation Extracted
a      fi doesn't co work correctly                    fi
b                   apples are cool                   NaN
c  this works but translation is ko                    ko

The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.

If you want to extract multiple matches, you could use .str.findall() and then ", ".join the results:

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
    df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)

Alternative without regex:

df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
    matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)

If you only want to match at the beginning or end of the sentences, then replace the first part with:

df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
    (splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...

Upvotes: 1

Uchiha012

Reputation: 851

apply function of Pandas might be helpful

def extract_explanation(dataframe):
    custom_substring = ["fi", "ap", "ko", "co", "ex"]
    substrings = dataframe['explanation'].split(" ")
    explanation = "N/A"
    for string in substrings:
        if string in custom_substring:
            explanation = string
    return explanation

df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)

The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.

Upvotes: 0

Nipuna Upeksha

Reputation: 416

I think this solves your problem.

import pandas as pd

lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])

extracted =[] 
for index, row in df.iterrows():
    tempList =[] 
    rowSplit = row['Explanation'].split(" ")
    for val in rowSplit:
        if val in lst:
            tempList.append(val)
    if len(tempList)>0:
        extracted.append(','.join(tempList))
    else:
        extracted.append('N/A')

df['Explanation Extracted'] = extracted

Upvotes: 0

How to extract strings from a list in a column in a python pandas dataframe?

Answers (4)

Related Questions