I want to fix a data frame that has the following aspect: etiqueta suma 2015-10 33 Baja California 12 Campeche 21 2015-11 12 Colima 6 Ciudad de México 6 2015-12 30 Ciudad de México 20 Quintana Roo 10 To make it look like this? fecha Baja California Campeche Colima Ciudad de México Quintana Roo 2015-10 12 21 0 0 0 2015-11 0 0 6 6 0 2015-12 0 0 0 20 10 I already tried regex to create another column with the dates but I'm blocked

Reputation: 303

How can I clean a column that has dates and variables at the same time on Pandas?

I want to fix a data frame that has the following aspect:

etiqueta	suma
2015-10	33
Baja California	12
Campeche	21
2015-11	12
Colima	6
Ciudad de México	6
2015-12	30
Ciudad de México	20
Quintana Roo	10

To make it look like this?

fecha	Baja California	Campeche	Colima	Ciudad de México	Quintana Roo
2015-10	12	21	0	0	0
2015-11	0	0	6	6	0
2015-12	0	0	0	20	10

I already tried regex to create another column with the dates but I'm blocked

Upvotes: 0

Answers (3)

wwnde

Reputation: 26676

#Create column fetcha by extracting dates. You do this by creating a new group using cumsum.

df['fecha']=df.groupby(df['etiqueta'].str.contains('\d').cumsum())['etiqueta'].apply(lambda x: x.str.extract('(\d{4}\-\d{2})')).fillna(method='ffill')

Once extracted, drop dates from etiqueta and pivot.

df[~df['etiqueta'].str.contains('-\d')].pivot(index='fecha', columns='etiqueta', values='suma').fillna(0).reset_index()

Following your comments. Looks like you have duplicates in the index.

df['fecha']=df.groupby(df['etiqueta'].str.contains('\d').cumsum())['etiqueta'].apply(lambda x: x.str.extract('(\d{4}\-\d{2})')).fillna(method='ffill')


df2=df[~df['etiqueta'].str.contains('-\d')]

pd.pivot(df2,index='fecha', columns='etiqueta', values='suma').fillna(0).reset_index()


etiqueta    fecha  Baja California  Campeche  Ciudad de México  Colima  \
0         2015-10             12.0      21.0               0.0     0.0   
1         2015-11              0.0       0.0               6.0     6.0   
2         2015-12              0.0       0.0              20.0     0.0   

etiqueta  Quintana Roo  
0                  0.0  
1                  0.0  
2                 10.0

Upvotes: 1

BENY

Reputation: 323306

Let us do pd.to_datetime then mask those return NaN and fill it with ffill

df['new'] = df['etiqueta'].mask(pd.to_datetime(df['etiqueta'], format = '%Y-%m', errors='coerce').isna()).ffill()
out = df.query('etiqueta!=new').pivot_table(index = 'new',columns = 'etiqueta',values= 'suma',fill_value=0)
Out[213]: 
etiqueta  Baja California  Campeche  Ciudad de México  Colima  Quintana Roo
new                                                                        
2015-10                12        21                 0       0             0
2015-11                 0         0                 6       6             0
2015-12                 0         0                20       0            10

Upvotes: 2

Henry Ecker

Reputation: 35646

Find rows where etiqueta matches the date pattern YYYY-MM.

Create Groups based on where those dates appear using cumsum. Use groupby transform to get the date from each group at the end of every row in that group.

Use a pivot_table to transition to wide format based on date.

Reset index and cleanup axis names.

import pandas as pd

df = pd.DataFrame({'etiqueta': {0: '2015-10', 1: 'Baja California',
                                2: 'Campeche', 3: '2015-11', 4: 'Colima',
                                5: 'Ciudad de México', 6: '2015-12',
                                7: 'Ciudad de México', 8: 'Quintana Roo'},
                   'suma': {0: 33, 1: 12, 2: 21, 3: 12, 4: 6, 5: 6, 6: 30,
                            7: 20, 8: 10}})

# Mask where matches date pattern
m = df['etiqueta'].str.match(r'\d{4}-\d{2}')

# Use Transform to add date to end of every row in each group
# (Date is the first element in each group)
df['fecha'] = df.groupby(m.cumsum())['etiqueta'].transform(lambda g: g.iloc[0])

# Pivot On groups (Excluding Date Rows)
out = df[~m].pivot_table(index='fecha',
                         columns='etiqueta',
                         values='suma',
                         fill_value=0)

# Reset Index and Drop Axis Name
out = out.reset_index().rename_axis(None, axis=1)

# For Display
print(out.to_string())

Out:

     fecha  Baja California  Campeche  Ciudad de México  Colima  Quintana Roo
0  2015-10               12        21                 0       0             0
1  2015-11                0         0                 6       6             0
2  2015-12                0         0                20       0            10

Upvotes: 1

How can I clean a column that has dates and variables at the same time on Pandas?

Answers (3)

Related Questions