Reputation: 65
So I have a Pandas dataframe that I am getting from an html webpage. The dataframe is ONLY 1 column and that column has no identifying name. I want to find a specific substring from within the dataframe, and then get the text immediately following that substring.
Note: there will NEVER be repeats in the substring search.
Eg: there will NEVER be 2 instances of School 2:
The dataframe is formatted like this:
School 1: 1 Hour Delay
School 2: 2 Hour Delay
School 3: Closed
I want to be able to search for School 3: and then return the status, whether it be closed, 1 hour delay, or 2 hour delay.
My initial thought was just if "School 3:" in df print("School 3: found")
But I just get an error from that, I'm assuming because you can't just check for a string like that. If anyone knows how to find a substring and then get the text after it I would love to know.
Upvotes: 1
Views: 550
Reputation: 402852
Assuming exactly one row always matches this condition, you can use str.extract
:
df.iloc[:,0].str.extract('(?<=School 3: )(.*)', expand=False).dropna().values[0]
# 'Closed'
(Note: if more than one row matches this condition, only the status of the first match is returned.)
Otherwise, if it is possible nothing matches, you will need a try-except:
try:
status = (df.iloc[:,0]
.str.extract('(?<=School 3: )(.*)', expand=False)
.dropna()
.values[0])
except (IndexError, ValueError):
status = np.nan
Upvotes: 1
Reputation: 10890
Supposed the dataframe looks like
status
0 School 1: 1 Hour Delay
1 School 2: 2 Hour Delay
2 School 3: Closed
you could do
txt = 'School 3'
df.status[df.status.str.contains(txt)].str[len(txt) + 2:] # +2 for skipping ": " after the school name
Result:
2 Closed
Name: status, dtype: object
However, IMO it would be even more clear by firstly splitting the single column which contains two informations in two:
df = df.status.str.split(': ', expand=True)
df.columns = ['school', 'status']
# school status
#0 School 1 1 Hour Delay
#1 School 2 2 Hour Delay
#2 School 3 Closed
then you can simply retrieve the contents of column two via boolean indexing of column one:
txt = 'School 3'
df.status[df.school==txt]
#2 Closed
#Name: status, dtype: object
Upvotes: 0