Reputation: 307
Working on a pdf extraction tool. Say I have the following Dataframe. I don't know the column names, or how many columns there are. All I know is in this dataframe, I can find the string extract this: xxxx
. I need to extract that string.
data = {'these':['Value1', 'padding'], 'are':['Value2', np.nan], 'random':[123, 'dont'], 'names':['extract this: 1236', 'find']}
df = pd.DataFrame(data)
+---------+--------+--------+--------------------+
| these | are | random | names |
+---------+--------+--------+--------------------+
| Value1 | Value2 | 123 | extract this: 1236 |
| padding | nan | dont | find |
+---------+--------+--------+--------------------+
I'm able to get it to an array where I could then clean to remove all non-string elements as shown below and I could then find the substring, but I don't like this approach. Is there a better way of doing this?
mask = np.column_stack([df[col].str.contains(r"extract this: ", na=False) for col in df])
inv_num_arr = df.loc[mask.any(axis=1)].values[0]
The output should just the string extract this: 1236
Upvotes: 0
Views: 206
Reputation: 7693
You can use re.search
by converting dataframe
into string
like
import re
re.search('extract this:\s\d+', df.to_string()).group(0)
'extract this: 1236'
Upvotes: 1