Zizi96
Zizi96

Reputation: 519

How to replace partial strings within a df column

I have the following table. Some entries within the StockCode column have letters at the end, how do I replace these with a number?

I could make a dictionary to map each letter to a number but I feel there a quicker way to do so.

InvoiceNo  |StockCode |
0   536     85123   
1   536     71053Z  
2   536     84406B  
3   536     22623S  

Upvotes: 1

Views: 101

Answers (2)

Ben
Ben

Reputation: 21705

This should do the trick.

# Make some data
df = pd.DataFrame({
    'InvoiceNo':536,
    'StockCode':['85123', '71053Z', '84406B', '22623S']
})

# Define replacements in a dictionary. (Note the values are strings, not ints)
replacements = {'Z':'1', 'B':'2', 'S':'3'}

#-- Edit per OP's comment ---------------
import string
keys = list(string.ascii_uppercase)
values = [str(i) for i in range(len(keys))]
replacements = dict(zip(keys, values))
#---------------------------------------

# Rebuild Stockcode
# 1) df.StockCode.str.extract('^(\d+)') extracts every sequence of numbers in StockCode, up until the first non-number character
# 2) df.StockCode.str.extract('([A-Z])$') extracts the non-number character at the end of each string
# 3) .replace(replacements).fillna('') makes the replacements and then changes NaN to ''
# 4) adding two series of strings concatenates them
df['StockCode'] = (df.StockCode.str.extract('^(\d+)') + 
                   df.StockCode.str.extract('([A-Z])$').replace(replacements).fillna(''))

print(df)
   InvoiceNo StockCode
0        536     85123
1        536    710531
2        536    844062
3        536    226233

The key to this is about understanding regular expressions.

Upvotes: 1

Ch3steR
Ch3steR

Reputation: 20689

You can use pd.Series.str.replace which is similar to re.sub, you can pass a function to repl param.

from random import randint
from string import ascii_uppercase

mapping = {char:str(randint(0, 9)) for char in ascii_uppercase} 
#{'A': '1', 'B': '3', 'C': '1', 'D': '1', 'E': '4', 'F': '9', 'G': '9', 'H': '5',
#'I': '8', 'J': '0', 'K': '0', 'L': '1', 'M': '5', 'N': '3', 'O': '8', 'P': '5', 
#'Q': '5', 'R': '8', 'S': '9', 'T': '8', 'U': '0', 'V': '8', 'W': '1', 'X': '6', 
#'Y': '7', 'Z': '5'}

def repl(match):
    return ''.join(mapping[char] for char in match.group(0))

df['StockCode'] = df['StockCode'].str.replace(r"[A-Z]{,3}$", repl)
df

  InvoiceNo StockCode
0        536     85123
1        536    710535  # 'Z' -> 5
2        536    844063  # 'B' -> 3
3        536    226239  # 'S' -> 9

When there is more than one alphabet at the end.

df = pd.DataFrame({
     'InvoiceNo':536,
     'StockCode':['85123', '71053ZAZ', '84406BAR', '22623BIR']
    })
df['StockCode'] = df['StockCode'].str.replace(r"[A-Z]{,3}$", repl)

   InvoiceNo  StockCode
0        536      85123
1        536   71053515  # 'ZAZ' -> '515'
2        536   84406318  # 'BAR' -> '318'
3        536   22623388  # 'BIR' -> '388'

Upvotes: 2

Related Questions