anarchy
anarchy

Reputation: 5184

How to modify a column of strings in pandas using a few conditionals (regex?)

I have a df that looks like this

    Identifier      Ticker      ISIN        Relationship Country .......
0   5.037663e+09    BTGGg.F     D10080162   Supplier     Germany .......
1   4.295870e+09    IVXG.DE     NaN         Supplier     Germany .......
2   5.043321e+09    SAPG.DE     D66992104   Customer     Germany
3   4.295869e+09    BMWG.DE     D12096109   Customer     Germany    
4   4.295870e+09    DTEGn.DE    D2035M136   Customer     Germany
5   4.295870e+09    IFXGn.DE    D35415104   Supplier     Germany
6   4.295869e+09    NSUG.DE     D04312100   Customer     Germany
7   5.000074e+09    EVKn.DE     D2R90Y117   Customer     Germany
8   4.295869e+09    LHAG.DE     D1908N106   Customer     Germany
9   4.295869e+09    MTXGn.DE    D5565H104   Supplier     Germany
10  4.295869e+09    SIEGn.DE    D69671218   Supplier     Germany
11  4.295870e+09    TKAG.DE     D8398Q119   Supplier     Germany
12  5.059963e+09    BNRGn.DE    D12459117   Customer     Germany
13  4.295869e+09    RHMG.DE     D65111102   Supplier     Germany
14  5.001195e+09    GBFG.DE     D11648108   Supplier     Germany
15  4.295869e+09    NXUG.DE     D5650J106   Customer     Germany
16  4.295870e+09    DPWGn.DE    D19225107   Supplier     Germany
17  4.295870e+09    ILM1k.DE    D22430116   Supplier     Germany
18  4.295869e+09    ADSGn.DE    D0066B185   Customer     Germany
19. 5.125125e+12    DBS.SG      D12300523.  Supplier     SG
........................................................................

In the rows where the df['Country'] == 'Germany',

I have 2 functions I want to perform.

function 1: I want to isolate the rows where there is a smaller case letter, any smaller case letter before the ".", I want to drop it, so BTGGg.F would become BTGG.F, 1LM1k.DE would become 1LM1.DE but NXUG.DE would be unaffected.

Using that new dataframe after the first function.

function 2: Then for the rows where there is an upper case G before the ".", I want to drop the G so RHMG.DE would become RHM.DE but 1LM1.DE would be unaffected.

If I wanted to drop just the letter before the ".", the function would be simple like x = x.replace(x[x.find(".")-1],"").

But I don't know how to do this in a dataframe, and I don't know how to apply the conditions I mentioned. Can it be done and if so, how?

I was thinking it might look something like this but this obviously didn't work I already tried it.

df.loc[df['Country'].eq('Germany'),'Ticker'] = df.loc[df['Country'].eq('Germany'),'Ticker'].str.replace((df['Ticker'][df['Ticker'].find(".")-1],"") if df['Ticker'][df['Ticker'].find(".")-1] == '([a-z])')

followed by

df.loc[df['Country'].eq('Germany'),'Ticker'] = df.loc[df['Country'].eq('Germany'),'Ticker'].str.replace((df['Ticker'][df['Ticker'].find(".")-1],"") if df['Ticker'][df['Ticker'].find(".")-1] == 'G')

Here is what the output should look like after the first round, from Country == Germany, remove the first lower case letter preceding "." :

    Identifier      Ticker      ISIN        Relationship Country .......
0   5.037663e+09    BTGG.F      D10080162   Supplier     Germany .......
1   4.295870e+09    IVXG.DE     NaN         Supplier     Germany .......
2   5.043321e+09    SAPG.DE     D66992104   Customer     Germany
3   4.295869e+09    BMWG.DE     D12096109   Customer     Germany    
4   4.295870e+09    DTEG.DE     D2035M136   Customer     Germany
5   4.295870e+09    IFXG.DE     D35415104   Supplier     Germany
6   4.295869e+09    NSUG.DE     D04312100   Customer     Germany
7   5.000074e+09    EVK.DE      D2R90Y117   Customer     Germany
8   4.295869e+09    LHAG.DE     D1908N106   Customer     Germany
9   4.295869e+09    MTXG.DE     D5565H104   Supplier     Germany
10  4.295869e+09    SIEG.DE     D69671218   Supplier     Germany
11  4.295870e+09    TKAG.DE     D8398Q119   Supplier     Germany
12  5.059963e+09    BNRG.DE     D12459117   Customer     Germany
13  4.295869e+09    RHMG.DE     D65111102   Supplier     Germany
14  5.001195e+09    GBFG.DE     D11648108   Supplier     Germany
15  4.295869e+09    NXUG.DE     D5650J106   Customer     Germany
16  4.295870e+09    DPWG.DE     D19225107   Supplier     Germany
17  4.295870e+09    ILM1.DE     D22430116   Supplier     Germany
18  4.295869e+09    ADSG.DE     D0066B185   Customer     Germany
19. 5.125125e+12    DBS.SG      D12300523   Supplier     SG
........................................................................

this is after the second round, from Country == Germany, remove the first upper case "G" preceding the ".":

    Identifier      Ticker      ISIN        Relationship Country .......
0   5.037663e+09    BTG.F       D10080162   Supplier     Germany .......
1   4.295870e+09    IVX.DE      NaN         Supplier     Germany .......
2   5.043321e+09    SAP.DE      D66992104   Customer     Germany
3   4.295869e+09    BMW.DE      D12096109   Customer     Germany    
4   4.295870e+09    DTE.DE      D2035M136   Customer     Germany
5   4.295870e+09    IFX.DE      D35415104   Supplier     Germany
6   4.295869e+09    NSU.DE      D04312100   Customer     Germany
7   5.000074e+09    EVK.DE      D2R90Y117   Customer     Germany
8   4.295869e+09    LHA.DE      D1908N106   Customer     Germany
9   4.295869e+09    MTX.DE      D5565H104   Supplier     Germany
10  4.295869e+09    SIE.DE      D69671218   Supplier     Germany
11  4.295870e+09    TKA.DE      D8398Q119   Supplier     Germany
12  5.059963e+09    BNR.DE      D12459117   Customer     Germany
13  4.295869e+09    RHM.DE      D65111102   Supplier     Germany
14  5.001195e+09    GBF.DE      D11648108   Supplier     Germany
15  4.295869e+09    NXU.DE      D5650J106   Customer     Germany
16  4.295870e+09    DPW.DE      D19225107   Supplier     Germany
17  4.295870e+09    ILM1.DE     D22430116   Supplier     Germany
18  4.295869e+09    ADS.DE      D0066B185   Customer     Germany
19. 5.125125e+12    DBS.SG      D12300523   Supplier     SG
........................................................................

Upvotes: 0

Views: 69

Answers (2)

OhadP
OhadP

Reputation: 81

For dealing with the functions you mentioned, you can use some regex to perform them both

re.sub(r'G?[a-z]?(?=\.)', '', 'HERE_YOUR_COUNTRY_THINGY')

You didn't given more rules

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195508

df.loc[df.Country == 'Germany', 'Ticker'] = df[df.Country == 'Germany']['Ticker'].str.replace(r'[a-z](?=\.)', '').str.replace(r'G(?=\.)', '')
print(df)

Prints:

      Identifier   Ticker        ISIN Relationship  Country
0   5.037663e+09    BTG.F   D10080162     Supplier  Germany
1   4.295870e+09   IVX.DE         NaN     Supplier  Germany
2   5.043321e+09   SAP.DE   D66992104     Customer  Germany
3   4.295869e+09   BMW.DE   D12096109     Customer  Germany
4   4.295870e+09   DTE.DE   D2035M136     Customer  Germany
5   4.295870e+09   IFX.DE   D35415104     Supplier  Germany
6   4.295869e+09   NSU.DE   D04312100     Customer  Germany
7   5.000074e+09   EVK.DE   D2R90Y117     Customer  Germany
8   4.295869e+09   LHA.DE   D1908N106     Customer  Germany
9   4.295869e+09   MTX.DE   D5565H104     Supplier  Germany
10  4.295869e+09   SIE.DE   D69671218     Supplier  Germany
11  4.295870e+09   TKA.DE   D8398Q119     Supplier  Germany
12  5.059963e+09   BNR.DE   D12459117     Customer  Germany
13  4.295869e+09   RHM.DE   D65111102     Supplier  Germany
14  5.001195e+09   GBF.DE   D11648108     Supplier  Germany
15  4.295869e+09   NXU.DE   D5650J106     Customer  Germany
16  4.295870e+09   DPW.DE   D19225107     Supplier  Germany
17  4.295870e+09  ILM1.DE   D22430116     Supplier  Germany
18  4.295869e+09   ADS.DE   D0066B185     Customer  Germany
19  5.125125e+12   DBS.SG  D12300523.     Supplier       SG

Upvotes: 1

Related Questions