Creating Dataframe from inconsistently named columns

Question

I have a pandas.DataFrame that has redundant column names due to the files(.csv) having inconsistent naming. This is resulting in columns with mostly NaN values

Bike #  Bikenumber  Bike#   SubscriberType  SubscriptionType
 NaN       NaN     W20848      NaN             Subscriber
 NaN       NaN     W20231      NaN             Subscriber
 NaN       NaN     W00785      NaN             Subscriber
 NaN       NaN     W00126      NaN             Subscriber
 NaN       NaN     W20929      NaN             Casual

Is there a way to create a new column and populate it from multiple columns that have values? And if more than one column is not NaN, can i select which column to pull the value from?

 Bike#   Bikenumber   Bike #   Selected_Num
number1   number2      NaN       number2

I can get this when trying to fill with a single column

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'])
print(sample)

    Bike #  Bikenumber  Bike#   SubscriberType  SubscriptionType   Bike_Num
     NaN       NaN     W20848      NaN             Subscriber       W20848
     NaN       NaN     W20231      NaN             Subscriber       W20231
     NaN       NaN     W00785      NaN             Subscriber       W00785
     NaN       NaN     W00126      NaN             Subscriber       W00126
     NaN       NaN     W20929      NaN             Casual           W20929

This fails for

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'], sample['Bikenumber'])

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Stephen Rauch · Accepted Answer

I suggest you solve this at the time of reading in the CSV's instead of trying to untangle them later. One way to do that is to use a small parser for the CSV files before passing them to pandas.

This parser takes an open file handle to the csv, and a dict which maps the desired column names to the various possible synonyms.

Code:

def read_my_csv(file_handle, column_map):
    # reverse the column mapping dict to use for synonym lookup
    synoms = dict(sum([
        [(syn, k) for syn in v] for k, v in column_map.items()], []))

    # build csv reader
    reader = csv.reader(file_handle)

    # get the header, and map columns to desired names
    header = next(reader)
    header = [synoms.get(c, c) for c in header]

    # yield the header
    yield header

    # yield the remaining rows
    for row in reader:
        yield row

Test Code:

import pandas as pd
import csv

column_map = {
    'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
    'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}

with open("sample.csv", 'rU') as f:
    generator = read_my_csv(f, column_map)
    columns = next(generator)
    df = pd.DataFrame(generator, columns=columns)

print(df)

Sample.csv:

Bike #,SubscriptionType
W20848,Subscriber
W20231,Subscriber
W00785,Subscriber
W00126,Subscriber
W20929,Casual

Results:

  Bike_Num     Sub_Num
0   W20848  Subscriber
1   W20231  Subscriber
2   W00785  Subscriber
3   W00126  Subscriber
4   W20929      Casual

Solution #2

A cleaner, but not nearly as fun, solution is to rename the columns before doing the concat:

Code:

def fix_column_names(df, column_map):
    # reverse the column mapping dict to use for synonym lookup
    synoms = dict(sum([
        [(syn, k) for syn in v] for k, v in column_map.items()], []))

    # rename columns
    df.columns = [synoms.get(c, c) for c in df.columns]

Test Code:

import pandas as pd
import csv

column_map = {
    'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
    'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}

df = pd.read_csv('sample.csv', header=0)
fix_column_names(df, column_map)
print(df)

Creating Dataframe from inconsistently named columns

Answers (1)

Related Questions