younggotti
younggotti

Reputation: 822

Extract string between two substrings in a pandas df column

I have the following dataframe:

                                           contract
0  Future(conId=482048803, symbol='ESTX50', lastT...
1  Future(conId=497000453, symbol='XT', lastTrade...
2  Stock(conId=321100413, symbol='SXRS', exchange...
3  Stock(conId=473087271, symbol='ETHEEUR', excha...
4  Stock(conId=80268543, symbol='IJPA', exchange=...
5  Stock(conId=153454120, symbol='EMIM', exchange...
6  Stock(conId=75776072, symbol='SXR8', exchange=...
7  Stock(conId=257200855, symbol='EGLN', exchange...
8  Stock(conId=464974581, symbol='VBTC', exchange...
9  Future(conId=478135706, symbol='ZN', lastTrade...

I want to create a new "symbol" column with all symbols (ESTX50, XT, SXRS...).

In order to extract the substring between "symbol='" and the following single quote, I tried the following:

df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")

but I get a column of NaN. What am I doing wrong? Thanks

Upvotes: 4

Views: 1793

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35686

It looks like that is a column of objects, not strings:

import pandas as pd


class Future:
    def __init__(self, symbol):
        self.symbol = symbol

    def __repr__(self):
        return f'Future(symbol=\'{self.symbol}\')'


df = pd.DataFrame({'contract': [Future(symbol='ESTX50'), Future(symbol='XT')]})

df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")
print(df)

df:

                  contract symbol
0  Future(symbol='ESTX50')    NaN
1      Future(symbol='XT')    NaN

Notice pandas considers strings to be object type so the string accessor is still allowed to attempt to perform operations. However, it cannot extract because these are not strings.

We can either convert to string first with astype:

df['symbol'] = df.contract.astype(str).str.extract(r"symbol='(.*?)'")

df:

                  contract  symbol
0  Future(symbol='ESTX50')  ESTX50
1      Future(symbol='XT')      XT

However, the faster approach is to try to extract the object property:

df['symbol'] = [getattr(x, 'symbol', None) for x in df.contract]

Or with apply (which can be slower than the comprehension)

df['symbol'] = df.contract.apply(lambda x: getattr(x, 'symbol', None))

Both produce:

                  contract  symbol
0  Future(symbol='ESTX50')  ESTX50
1      Future(symbol='XT')      XT

Upvotes: 3

Related Questions