Reputation: 15152
I am trying to fit this data:
[['Manufacturer: Hyundai',
'Model: Tucson',
'Mileage: 258000 km',
'Registered: 07/2019'],
['Manufacturer: Mazda',
'Model: 6',
'Year: 2014',
'Registered: 07/2019']]
to pandas DataFrame.
Not all labels are present in each record, for example some records have 'Mileage' and others don't and vice versa. I have a total of 26 features and very few items have all of them.
I would like to construct pandas DataFrame that will hold features in columns and if feature doesn't exists than content should be 'NaN'.
I have
colnames=['Manufacturer', 'Model', 'Mileage', 'Registered', 'Year'...(all 26 features here)]
df = pd.read_csv("./data/output.csv", sep=",", names=colnames, header=None)
Few first prerequisite columns are giving expected output but when it comes to optional features than missing data causing features after that to turn out under wrong columns. Records are mapped correctly only if all features are present.
I forgot to mention that some features that are missing value also don't have ":" but are present in list. So in this 2 cases:
assignment for both cases should be 'NaN'.
Upvotes: 2
Views: 434
Reputation: 862511
Use nested list comprehension for list of dictionaries and pass to DataFrame
contructor, if same key is missing is added NaN
:
L = [['Manufacturer: Hyundai',
'Model: Tucson',
'Mileage: 258000 km',
'Registered: 07/2019'],
['Manufacturer: Mazda',
'Model: 6',
'Year: 2014',
'Registered: 07/2019']]
df = pd.DataFrame([dict(y.split(':') for y in x) for x in L])
print (df)
Manufacturer Mileage Model Registered Year
0 Hyundai 258000 km Tucson 07/2019 NaN
1 Mazda NaN 6 07/2019 2014
EDIT: You can use .split(maxsplit=1)
for split by first whitespace:
L = [['Manufacturer Hyundai',
'Model Tucson',
'Mileage 258000 km',
'Registered 07/2019'],
['Manufacturer Mazda',
'Model 6',
'Year 2014',
'Registered 07/2019']]
df = pd.DataFrame([dict(y.split(maxsplit=1) for y in x) for x in L])
print (df)
Manufacturer Mileage Model Registered Year
0 Hyundai 258000 km Tucson 07/2019 NaN
1 Mazda NaN 6 07/2019 2014
EDIT:
L = [['Manufacturer Hyundai',
'Model Tucson',
'Mileage 258000 km',
'Registered 07/2019'],
['Manufacturer Mazda',
'Model 6',
'Year 2014',
'Registered 07/2019',
'Additional equipment aaa']]
words2 = ['Additional equipment']
L1 = []
for x in L:
di = {}
for y in x:
for word in words2:
if set(word.split(maxsplit=2)[:2]) < set(y.split()):
i, j, k = y.split(maxsplit=2)
di['_'.join([i, j])] = k
else:
i, j = y.split(maxsplit=1)
di[i] = j
L1.append(di)
df = pd.DataFrame(L1)
print (df)
Additional_equipment Manufacturer Mileage Model Registered Year
0 NaN Hyundai 258000 km Tucson 07/2019 NaN
1 aaa Mazda NaN 6 07/2019 2014
Upvotes: 4