BenL
BenL

Reputation: 97

Systematic heading-to-value data cleaning in Pandas

I have a very wide dataset which consists of hundreds of date-value column pairs - however the heading of the values column contains the reference of the site from which the data is taken. I'd like to take this header as a new "site_name" column before pivoting this data to a long format.

The data for each site is the same 2-column format, so I'd like to be able to apply a solution across the whole dataset at once.

My code below illustrates the problem on a single date-value pair

Note: I've used asterisks to mean I'm describing the column names, rather than quoting them

import pandas as pd

current = pd.DataFrame({"*unnamed_date_column*" : ["2021-10-21", "2021-10-22", "2021-10-23"],
                        "*unique_site_name*"    : [1.1, 1.2, 1.3]})

desired = pd.DataFrame({"date"   : ["2021-10-21", "2021-10-22", "2021-10-23"], 
                        "values" : [1.1, 1.2, 1.3], 
                        "site"   : ["unique_site_name", "unique_site_name", "unique_site_name"]})

Upvotes: 2

Views: 70

Answers (3)

BenL
BenL

Reputation: 97

Solved:

Almost certainly not the optimal way to do this but worked for my 200-ish column data frame:

  • Extract the names of all columns
  • Split these into groups of 2 using np.array_split()
  • Iterate over this list of date-value pairs, extracting the data applying the assign/rename-with-dictionary solve by @mozway
  • Save the altered dataframes into a new dictionary and concatenate
colnames = np.array_split(list(current.columns), len(list(current.columns)) / 2)    

reorg = {}

for i in range(0, length):
        df = current[colnames[i]]
        rename_ind = [0, 1]
        old_names = df.columns[rename_ind]
        new_names = ['date', 'value']
    
        new_df = df.assign(site = df.columns[1])
        new_df.rename(columns = dict(zip(old_names, new_names)), inplace = True)
    
        reorg[i] = new_df

pd.concat(reorg.values())

Upvotes: 0

eshirvana
eshirvana

Reputation: 24613

you can use melt :

desired = current.melt(id_vars=["*unnamed_date_column*"],var_name=['site']).rename(columns ={"*unnamed_date_column*": "date"})

output:

         date                site  value
0  2021-10-21  *unique_site_name*    1.1
1  2021-10-22  *unique_site_name*    1.2
2  2021-10-23  *unique_site_name*    1.3

Upvotes: 1

mozway
mozway

Reputation: 262359

Difficult to know how this will generalize without knowing more examples, but you could try:

desired = (current
 .assign(site=current.columns[-1]) # arbitrarily chose to index from end
 .rename(columns=dict(zip(current.columns, ['date', 'values'])))
)

output:

         date  values                site
0  2021-10-21     1.1  *unique_site_name*
1  2021-10-22     1.2  *unique_site_name*
2  2021-10-23     1.3  *unique_site_name*

Upvotes: 1

Related Questions