Johnson
Johnson

Reputation: 95

Extract a single value from a pandas dataframe

In Python I'm trying to extract a single value from a Pandas dataframe. I know exactly what the value contains, I just need to find it anywhere in the dataframe and extract it.

For example, in the dataframe below:

df = pd.DataFrame(
        {0: ['BA1234', 'CA:1234', 'DA','DA1234', 'EX DA', 'CA1234'],
         1: ['BA1234', 'CA:1234', 'DA','CA1234', 'EX DA', 'CA1234'],
         2: ['BA1234', 'CA:1234', 'DA','CA1234', 'EX DA', 'CA1234']})

I want to extract the string containing the two letters 'DA' and exactly 4 digits after it.

I've been trying this using a mask:

mask = pd.DataFrame(np.column_stack([df[col].str.contains('^DA\d{4}', na = False) for col in df]))

Which seems to work:

da_value = df[mask]

da_value
        0    1    2
0     NaN  NaN  NaN
1     NaN  NaN  NaN
2     NaN  NaN  NaN
3  DA1234  NaN  NaN
4     NaN  NaN  NaN
5     NaN  NaN  NaN

However, how do I extract the value from the dataframe? Is there a better/easier way of doing this?

Edit: The output I actually want is

da_value = 'DA1234'

Upvotes: 2

Views: 3120

Answers (4)

Serge Ballesta
Serge Ballesta

Reputation: 148870

As you want to search the value anywhere in the dataframe, you could reshape the values to make it a single dimension Series:

s = pd.Series(df.values.reshape(len(df) * len(df.columns)))
s = s.loc[s.str.match(r'DA[0-9]{4}')]
if len(s) == 0:
    print('Not found')
else:
    print(s.iloc[0])

With your example data it just prints

DA1234

Upvotes: 0

iamklaus
iamklaus

Reputation: 3770

if you only want that row in which a string follows the condition, below works

using re.findall

df.loc[df.apply(lambda x: True if re.findall('^DA\d{4}',x[0]) or re.findall('^DA\d{4}',x[1]) or re.findall('^DA\d{4}',x[2]) else False, axis=1)]

Output

        0       1       2
3  DA1234  CA1234  CA1234

Updated

df.apply(lambda x: re.findall('^DA\d{4}',' '.join(list(x))), axis=1).any()[0]

Output

'DA1234'

Upvotes: 1

jezrael
jezrael

Reputation: 862406

Use DataFrame.stack first for Series, then filter by boolean indexing with Series.str.contains:

s = df.stack()
a = s[s.str.contains(r'^DA\d{4}', na=False)].tolist()

If need first value from list you can select:

print (a[0])
DA1234

Or general solution if possible no value exist, then default value is added:

print (next(iter(a), 'no match'))
DA1234

Upvotes: 3

anky
anky

Reputation: 75080

You can use df.apply() to apply series.str.contains() along with df.any() over axis=1 to fetch the rows where any column matches the pattern:

df[df.apply(lambda x: x.str.contains(r'^DA\d{4}', na=False)).any(axis=1)]

        0       1       2
3  DA1234  CA1234  CA1234

Upvotes: 2

Related Questions