Reputation: 519
I have the following table. Some entries within the StockCode column have letters at the end, how do I replace these with a number?
I could make a dictionary to map each letter to a number but I feel there a quicker way to do so.
InvoiceNo |StockCode |
0 536 85123
1 536 71053Z
2 536 84406B
3 536 22623S
Upvotes: 1
Views: 101
Reputation: 21705
This should do the trick.
# Make some data
df = pd.DataFrame({
'InvoiceNo':536,
'StockCode':['85123', '71053Z', '84406B', '22623S']
})
# Define replacements in a dictionary. (Note the values are strings, not ints)
replacements = {'Z':'1', 'B':'2', 'S':'3'}
#-- Edit per OP's comment ---------------
import string
keys = list(string.ascii_uppercase)
values = [str(i) for i in range(len(keys))]
replacements = dict(zip(keys, values))
#---------------------------------------
# Rebuild Stockcode
# 1) df.StockCode.str.extract('^(\d+)') extracts every sequence of numbers in StockCode, up until the first non-number character
# 2) df.StockCode.str.extract('([A-Z])$') extracts the non-number character at the end of each string
# 3) .replace(replacements).fillna('') makes the replacements and then changes NaN to ''
# 4) adding two series of strings concatenates them
df['StockCode'] = (df.StockCode.str.extract('^(\d+)') +
df.StockCode.str.extract('([A-Z])$').replace(replacements).fillna(''))
print(df)
InvoiceNo StockCode
0 536 85123
1 536 710531
2 536 844062
3 536 226233
The key to this is about understanding regular expressions.
Upvotes: 1
Reputation: 20689
You can use pd.Series.str.replace
which is similar to re.sub
, you can pass a function to repl
param.
from random import randint
from string import ascii_uppercase
mapping = {char:str(randint(0, 9)) for char in ascii_uppercase}
#{'A': '1', 'B': '3', 'C': '1', 'D': '1', 'E': '4', 'F': '9', 'G': '9', 'H': '5',
#'I': '8', 'J': '0', 'K': '0', 'L': '1', 'M': '5', 'N': '3', 'O': '8', 'P': '5',
#'Q': '5', 'R': '8', 'S': '9', 'T': '8', 'U': '0', 'V': '8', 'W': '1', 'X': '6',
#'Y': '7', 'Z': '5'}
def repl(match):
return ''.join(mapping[char] for char in match.group(0))
df['StockCode'] = df['StockCode'].str.replace(r"[A-Z]{,3}$", repl)
df
InvoiceNo StockCode
0 536 85123
1 536 710535 # 'Z' -> 5
2 536 844063 # 'B' -> 3
3 536 226239 # 'S' -> 9
When there is more than one alphabet at the end.
df = pd.DataFrame({
'InvoiceNo':536,
'StockCode':['85123', '71053ZAZ', '84406BAR', '22623BIR']
})
df['StockCode'] = df['StockCode'].str.replace(r"[A-Z]{,3}$", repl)
InvoiceNo StockCode
0 536 85123
1 536 71053515 # 'ZAZ' -> '515'
2 536 84406318 # 'BAR' -> '318'
3 536 22623388 # 'BIR' -> '388'
r"[A-Z]{,3}$"
in regex101Upvotes: 2