Reputation: 95
In Python I'm trying to extract a single value from a Pandas dataframe. I know exactly what the value contains, I just need to find it anywhere in the dataframe and extract it.
For example, in the dataframe below:
df = pd.DataFrame(
{0: ['BA1234', 'CA:1234', 'DA','DA1234', 'EX DA', 'CA1234'],
1: ['BA1234', 'CA:1234', 'DA','CA1234', 'EX DA', 'CA1234'],
2: ['BA1234', 'CA:1234', 'DA','CA1234', 'EX DA', 'CA1234']})
I want to extract the string containing the two letters 'DA' and exactly 4 digits after it.
I've been trying this using a mask:
mask = pd.DataFrame(np.column_stack([df[col].str.contains('^DA\d{4}', na = False) for col in df]))
Which seems to work:
da_value = df[mask]
da_value
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 DA1234 NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
However, how do I extract the value from the dataframe? Is there a better/easier way of doing this?
Edit: The output I actually want is
da_value = 'DA1234'
Upvotes: 2
Views: 3120
Reputation: 148870
As you want to search the value anywhere in the dataframe, you could reshape the values to make it a single dimension Series:
s = pd.Series(df.values.reshape(len(df) * len(df.columns)))
s = s.loc[s.str.match(r'DA[0-9]{4}')]
if len(s) == 0:
print('Not found')
else:
print(s.iloc[0])
With your example data it just prints
DA1234
Upvotes: 0
Reputation: 3770
if you only want that row in which a string follows the condition, below works
using re.findall
df.loc[df.apply(lambda x: True if re.findall('^DA\d{4}',x[0]) or re.findall('^DA\d{4}',x[1]) or re.findall('^DA\d{4}',x[2]) else False, axis=1)]
Output
0 1 2
3 DA1234 CA1234 CA1234
Updated
df.apply(lambda x: re.findall('^DA\d{4}',' '.join(list(x))), axis=1).any()[0]
Output
'DA1234'
Upvotes: 1
Reputation: 862406
Use DataFrame.stack
first for Series
, then filter by boolean indexing
with Series.str.contains
:
s = df.stack()
a = s[s.str.contains(r'^DA\d{4}', na=False)].tolist()
If need first value from list you can select:
print (a[0])
DA1234
Or general solution if possible no value exist, then default value is added:
print (next(iter(a), 'no match'))
DA1234
Upvotes: 3