Reputation: 303
I want to fix a data frame that has the following aspect:
etiqueta | suma |
---|---|
2015-10 | 33 |
Baja California | 12 |
Campeche | 21 |
2015-11 | 12 |
Colima | 6 |
Ciudad de México | 6 |
2015-12 | 30 |
Ciudad de México | 20 |
Quintana Roo | 10 |
To make it look like this?
fecha | Baja California | Campeche | Colima | Ciudad de México | Quintana Roo |
---|---|---|---|---|---|
2015-10 | 12 | 21 | 0 | 0 | 0 |
2015-11 | 0 | 0 | 6 | 6 | 0 |
2015-12 | 0 | 0 | 0 | 20 | 10 |
I already tried regex to create another column with the dates but I'm blocked
Upvotes: 0
Views: 67
Reputation: 26676
#Create column fetcha
by extracting dates. You do this by creating a new group using cumsum.
df['fecha']=df.groupby(df['etiqueta'].str.contains('\d').cumsum())['etiqueta'].apply(lambda x: x.str.extract('(\d{4}\-\d{2})')).fillna(method='ffill')
Once extracted, drop dates from etiqueta
and pivot.
df[~df['etiqueta'].str.contains('-\d')].pivot(index='fecha', columns='etiqueta', values='suma').fillna(0).reset_index()
Following your comments. Looks like you have duplicates in the index.
df['fecha']=df.groupby(df['etiqueta'].str.contains('\d').cumsum())['etiqueta'].apply(lambda x: x.str.extract('(\d{4}\-\d{2})')).fillna(method='ffill')
df2=df[~df['etiqueta'].str.contains('-\d')]
pd.pivot(df2,index='fecha', columns='etiqueta', values='suma').fillna(0).reset_index()
etiqueta fecha Baja California Campeche Ciudad de México Colima \
0 2015-10 12.0 21.0 0.0 0.0
1 2015-11 0.0 0.0 6.0 6.0
2 2015-12 0.0 0.0 20.0 0.0
etiqueta Quintana Roo
0 0.0
1 0.0
2 10.0
Upvotes: 1
Reputation: 323306
Let us do pd.to_datetime
then mask those return NaN
and fill it with ffill
df['new'] = df['etiqueta'].mask(pd.to_datetime(df['etiqueta'], format = '%Y-%m', errors='coerce').isna()).ffill()
out = df.query('etiqueta!=new').pivot_table(index = 'new',columns = 'etiqueta',values= 'suma',fill_value=0)
Out[213]:
etiqueta Baja California Campeche Ciudad de México Colima Quintana Roo
new
2015-10 12 21 0 0 0
2015-11 0 0 6 6 0
2015-12 0 0 20 0 10
Upvotes: 2
Reputation: 35646
Find rows where etiqueta
matches the date pattern YYYY-MM
.
Create Groups based on where those dates appear using cumsum
. Use groupby transform to get the date from each group at the end of every row in that group.
Use a pivot_table
to transition to wide format based on date.
Reset index and cleanup axis names.
import pandas as pd
df = pd.DataFrame({'etiqueta': {0: '2015-10', 1: 'Baja California',
2: 'Campeche', 3: '2015-11', 4: 'Colima',
5: 'Ciudad de México', 6: '2015-12',
7: 'Ciudad de México', 8: 'Quintana Roo'},
'suma': {0: 33, 1: 12, 2: 21, 3: 12, 4: 6, 5: 6, 6: 30,
7: 20, 8: 10}})
# Mask where matches date pattern
m = df['etiqueta'].str.match(r'\d{4}-\d{2}')
# Use Transform to add date to end of every row in each group
# (Date is the first element in each group)
df['fecha'] = df.groupby(m.cumsum())['etiqueta'].transform(lambda g: g.iloc[0])
# Pivot On groups (Excluding Date Rows)
out = df[~m].pivot_table(index='fecha',
columns='etiqueta',
values='suma',
fill_value=0)
# Reset Index and Drop Axis Name
out = out.reset_index().rename_axis(None, axis=1)
# For Display
print(out.to_string())
Out
:
fecha Baja California Campeche Ciudad de México Colima Quintana Roo
0 2015-10 12 21 0 0 0
1 2015-11 0 0 6 6 0
2 2015-12 0 0 20 0 10
Upvotes: 1