Reputation: 43
I'm rather new to Python and i'm having some troubles. I have the following dataframe:
import pandas as pd
data = {'v1':('Belgium[country]', 'Antwerp[city]', 'Gent[city]', 'France[country]', 'Paris[city]', 'Marseille[city]', 'Toulouse[city]', 'Spain[country]', 'Madrid[city]')}
df = pd.DataFrame(data)
df
v1
0 Belgium[country]
1 Antwerp[city]
2 Gent[city]
3 France[country]
4 Paris[city]
5 Marseille[city]
6 Toulouse[city]
7 Spain[country]
8 Madrid[city]
Which I want to map to the following format:
v1 v2
0 Belgium[country] Antwerp[city]
1 Belgium[country] Gent[city]
2 France[country] Paris[city]
3 France[country] Marseille[city]
4 France[country] Toulouse[city]
5 Spain[country] Madrid[city]
I found a way to do this using a dictionary, but since I want to maintain the order I'm looking for a way to do this using a list or so.
I tried it both based on the indexes and on the values themselves (specifically [country] and [city]), but i failed with both. Any help is much appreciated!
Upvotes: 4
Views: 101
Reputation: 862591
Solution without groupby
:
#rename columns
df = df.rename(columns={'v1':'v2'})
#get counter
counter= df.v2.str.contains('country').cumsum()
#get mask where are changed country to city
df.insert(0, 'v1', df.loc[counter.ne(counter.shift()), 'v2'])
#forward filling NaN
df.v1 = df.v1.ffill()
#remove rows where v1 == v2
df = df[df.v1.ne(df.v2)].reset_index(drop=True)
print (df)
v1 v2
0 Belgium[country] Antwerp[city]
1 Belgium[country] Gent[city]
2 France[country] Paris[city]
3 France[country] Marseille[city]
4 France[country] Toulouse[city]
5 Spain[country] Madrid[city]
Timings:
In [189]: %timeit (jez(df))
100 loops, best of 3: 2.47 ms per loop
In [191]: %timeit (IanS(df1))
100 loops, best of 3: 5.06 ms per loop
Code for timings:
def jez(df):
df = df.rename(columns={'v1':'v2'})
counter= df.v2.str.contains('country').cumsum()
df.insert(0, 'v1', df.loc[counter.ne(counter.shift()), 'v2'])
df.v1 = df.v1.ffill()
df = df[df.v1.ne(df.v2)].reset_index(drop=True)
return (df)
def IanS(df):
counter = df['v1'].str.contains('country').cumsum()
result = df.groupby(counter).apply(lambda g: g[1:]).reset_index(level=1, drop=True)
result = result.rename(columns={'v1': 'v2'}).reset_index(drop=False)
result['v1'] = result['v1'].replace(df.groupby(counter).first().squeeze())
return (result)
Upvotes: 2
Reputation: 16241
This will work:
counter = df['v1'].str.contains('country').cumsum()
result = df.groupby(counter).apply(lambda g: g[1:]).reset_index(level=1, drop=True)
result = result.rename(columns={'v1': 'v2'}).reset_index(drop=False)
result['v1'] = result['v1'].replace(df.groupby(counter).first().squeeze())
The idea is to add a counter that is incremented for each new country. You can then group by this counter to access the information you need.
Specifically, the first step is to keep only cities (g[1:]
for each group g
). Then do some renaming and reindexing. Finally, use the result from another groupby (giving the country) to replace the values in column v1
.
Upvotes: 2