Reputation: 61
I am new to Python so apologies if it comes across as a really easy question. I have the following example dataframe below (the real one has thousands of rows)
df = {'City': ['London','Tokyo','London','Paris','Paris','London','Tokyo','Tokyo', 'Paris','Berlin','Berlin','Berlin'],
'Code': ['367','812','367','964','964','BN611','812','Y366','Z167','L715','412','L715']}
df = pd.DataFrame(data=df)
What I am trying to do is substitute all the instances where the code for a city starts with a letter with its numeric equivalent (for example, the code for each line where London appears should always be 367 instead of 367 and BN611).
Is there a way to do this easily without having to write a code for each city separately? The dataframe I have contains a few hundred more cities and I will need to apply this to every one of them. The codes for each city are always the same (for example London always has the code 367 and BN611, etc.)
Thank you
Upvotes: 2
Views: 125
Reputation: 261015
You can mask
the invalid codes and use a groupby.transform
to replace with the valid ones:
# which codes start with a letter?
m = df['Code'].str.match(r'[A-Z]')
# mask them and replace each code by the first valid one per city
df['Code'] = df['Code'].mask(m).groupby(df['City']).transform('first')
output (as new column "Code2" for clarity):
City Code Code2
0 London 367 367
1 Tokyo 812 812
2 London 367 367
3 Paris 964 964
4 Paris 964 964
5 London BN611 367
6 Tokyo 812 812
7 Tokyo Y366 812
8 Paris Z167 964
9 Berlin L715 412
10 Berlin 412 412
11 Berlin L715 412
Upvotes: 2