How can I find a specific substring in a Pandas DataFrame, and then get the text after it?

Question

So I have a Pandas dataframe that I am getting from an html webpage. The dataframe is ONLY 1 column and that column has no identifying name. I want to find a specific substring from within the dataframe, and then get the text immediately following that substring.

Note: there will NEVER be repeats in the substring search.
Eg: there will NEVER be 2 instances of School 2:

The dataframe is formatted like this:

School 1: 1 Hour Delay
School 2: 2 Hour Delay
School 3: Closed

I want to be able to search for School 3: and then return the status, whether it be closed, 1 hour delay, or 2 hour delay.

My initial thought was just if "School 3:" in df print("School 3: found") But I just get an error from that, I'm assuming because you can't just check for a string like that. If anyone knows how to find a substring and then get the text after it I would love to know.

cs95 · Accepted Answer

Assuming exactly one row always matches this condition, you can use str.extract:

df.iloc[:,0].str.extract('(?<=School 3: )(.*)', expand=False).dropna().values[0]
# 'Closed'

(Note: if more than one row matches this condition, only the status of the first match is returned.)

Otherwise, if it is possible nothing matches, you will need a try-except:

try:
    status = (df.iloc[:,0]
                .str.extract('(?<=School 3: )(.*)', expand=False)
                .dropna()
                .values[0])    
except (IndexError, ValueError):
    status = np.nan

How can I find a specific substring in a Pandas DataFrame, and then get the text after it?

Answers (2)

Related Questions