Reputation: 411
I have a pandas.DataFrame
that has redundant column names due to the files(.csv) having inconsistent naming. This is resulting in columns with mostly NaN values
Bike # Bikenumber Bike# SubscriberType SubscriptionType
NaN NaN W20848 NaN Subscriber
NaN NaN W20231 NaN Subscriber
NaN NaN W00785 NaN Subscriber
NaN NaN W00126 NaN Subscriber
NaN NaN W20929 NaN Casual
Is there a way to create a new column and populate it from multiple columns that have values? And if more than one column is not NaN
, can i select which column to pull the value from?
Bike# Bikenumber Bike # Selected_Num
number1 number2 NaN number2
I can get this when trying to fill with a single column
sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'])
print(sample)
Bike # Bikenumber Bike# SubscriberType SubscriptionType Bike_Num
NaN NaN W20848 NaN Subscriber W20848
NaN NaN W20231 NaN Subscriber W20231
NaN NaN W00785 NaN Subscriber W00785
NaN NaN W00126 NaN Subscriber W00126
NaN NaN W20929 NaN Casual W20929
This fails for
sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'], sample['Bikenumber'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Upvotes: 1
Views: 54
Reputation: 49812
I suggest you solve this at the time of reading in the CSV's instead of trying to untangle them later. One way to do that is to use a small parser for the CSV files before passing them to pandas
.
This parser takes an open file handle to the csv
, and a dict which maps the desired column names to the various possible synonyms.
Code:
def read_my_csv(file_handle, column_map):
# reverse the column mapping dict to use for synonym lookup
synoms = dict(sum([
[(syn, k) for syn in v] for k, v in column_map.items()], []))
# build csv reader
reader = csv.reader(file_handle)
# get the header, and map columns to desired names
header = next(reader)
header = [synoms.get(c, c) for c in header]
# yield the header
yield header
# yield the remaining rows
for row in reader:
yield row
Test Code:
import pandas as pd
import csv
column_map = {
'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}
with open("sample.csv", 'rU') as f:
generator = read_my_csv(f, column_map)
columns = next(generator)
df = pd.DataFrame(generator, columns=columns)
print(df)
Sample.csv:
Bike #,SubscriptionType
W20848,Subscriber
W20231,Subscriber
W00785,Subscriber
W00126,Subscriber
W20929,Casual
Results:
Bike_Num Sub_Num
0 W20848 Subscriber
1 W20231 Subscriber
2 W00785 Subscriber
3 W00126 Subscriber
4 W20929 Casual
Solution #2
A cleaner, but not nearly as fun, solution is to rename the columns before doing the concat:
Code:
def fix_column_names(df, column_map):
# reverse the column mapping dict to use for synonym lookup
synoms = dict(sum([
[(syn, k) for syn in v] for k, v in column_map.items()], []))
# rename columns
df.columns = [synoms.get(c, c) for c in df.columns]
Test Code:
import pandas as pd
import csv
column_map = {
'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}
df = pd.read_csv('sample.csv', header=0)
fix_column_names(df, column_map)
print(df)
Upvotes: 1