KL_
KL_

Reputation: 309

Create DataFrame using lists with missing column data

CONTEXT

I am trying to create a DataFrame and fill out columns in that DataFrame based on whether or not the inserted lists have those columns.

Example Data:
Name    Height   Hair Color   Eye Color
Bob     72           Blonde       Blue
George  64                        Green
John                 Brown        Brown

The columns in the DataFrame would contain all the variables I want recorded but if a person does not have information for each column I'd like to fill out what I can in the DataFrame.

Sample Data / Code

name = ['Name', 'Bob']    <----- Each element has the associated column name and the value in a list.
height = ['Height', '72'] <----- Possible to search for height[0] in columns and place height[1] in there?
eye_color = ['Eye Color', 'Brown']

person = [name, height, eye_color]
columns = ['Name', 'Height', 'Hair Color', 'Eye Color'] 

df = pd.DataFrame(person, columns = columns)

Expected Outcome

Name    Height    Hair   Eye Color
Bob     72               Brown

PROBLEM

I want to be able to pass a person through and fill out a column based on the information that is there and leave any columns that aren't there blank. And append people to the DataFrame in the same fashion. Is this possible?

Please let me know if any additional details would help in answering this question!

Upvotes: 0

Views: 336

Answers (3)

wwii
wwii

Reputation: 23773

You can make an empty DataFrame and just specify the columns.

In [21]: df = pd.DataFrame(columns=['name','a','b','c'])

In [22]: df
Out[22]: 
Empty DataFrame
Columns: [name, a, b, c]
Index: []

Then you can append

In [23]: df = df.append({'name':'bob','c':0},ignore_index=True)

In [24]: df
Out[24]: 
  name    a    b  c
0  bob  NaN  NaN  0

In [25]: df = df.append({'name':'geo','b':'foo'},ignore_index=True)

In [26]: df
Out[26]: 
  name    a    b    c
0  bob  NaN  NaN    0
1  geo  NaN  foo  NaN

Multiple rows:

In [32]: more = [{'name':'qq','b':'apples'},
                 {'name':'wildbill','a':'nickels'},
                 {'name':'lastone','b':'potatoes','c':16}]

In [33]: df = df.append(more,ignore_index=True)

In [33]: 

In [34]: df
Out[34]: 
       name        a         b    c
0       bob      NaN       NaN    0
1       geo      NaN       foo  NaN
2        qq      NaN    apples  NaN
3  wildbill  nickels       NaN  NaN
4   lastone      NaN  potatoes   16

Or if you can ensure all the columns are covered:

In [36]: more
Out[36]: 
[{'b': 'apples', 'name': 'qq'},
 {'a': 'nickels', 'name': 'wildbill'},
 {'b': 'potatoes', 'c': 16, 'name': 'lastone'}]

In [37]: pd.DataFrame(more)
Out[37]: 
         a         b     c      name
0      NaN    apples   NaN        qq
1  nickels       NaN   NaN  wildbill
2      NaN  potatoes  16.0   lastone

Looks like DataFrame will consume a generator.

In [3]: more
Out[3]: 
[{'b': 'apples', 'name': 'qq'},
 {'a': 'nickels', 'name': 'wildbill'},
 {'b': 'potatoes', 'c': 16, 'name': 'lastone'}]

In [4]: def f():
   ...:     for d in more:
   ...:         yield d
   ...:         

In [5]: pd.DataFrame(f())
Out[5]: 
         a         b     c      name
0      NaN    apples   NaN        qq
1  nickels       NaN   NaN  wildbill
2      NaN  potatoes  16.0   lastone

There is probably a better way.

Upvotes: 1

David Erickson
David Erickson

Reputation: 16683

Here is a dynamic list comprehension method using the lists you have created in this example:

name = ['Name', 'Bob']
height = ['Height', '72']
eye_color = ['Eye Color', 'Brown']

person = [name, height, eye_color]
columns = ['Name', 'Height', 'Hair Color', 'Eye Color'] 

df = pd.DataFrame([{i:j} for (i,j) in zip([name[0], height[0], eye_color[0]],
                                          [name[1], height[1], eye_color[1]])
                         for col in df.columns if i == col], columns=columns)
df = df.apply(lambda x: pd.Series(x.dropna().values))
df

    Name    Height  Hair Color  Eye Color
0    Bob        72         NaN      Brown

Upvotes: 0

noah
noah

Reputation: 2786

Are you open to rethinking what a person object is? If so you should consider dict for each person like below. It makes your life much easier.

import pandas as pd

columns = ['Name', 'Height', 'Hair Color', 'Eye Color'] 
df = pd.DataFrame(columns = columns)

person = {'Name':['Bob'], 'Height':['72'], 'Eye Color': ['Brown']}
person2 = {'Name':['Sue'], 'Height':['48'], 'Eye Color': ['Blue'], 'Hair Color': ['Blonde']}
person3 = {'Name':['Hank'], 'Height':['74'], 'Hair Color': ['Black']}

#add persons... could loop through
df = df.append(pd.DataFrame(person))
df = df.append(pd.DataFrame(person2))
df = df.append(pd.DataFrame(person3))
print(df)

   Name Height Hair Color Eye Color
0   Bob     72        NaN     Brown
0   Sue     48     Blonde      Blue
0  Hank     74      Black       NaN

If you don't want to change person you can also just make a simple function to convert it:

def person_to_dict(person):
    person_dict = {}
    for attr in person:
        person_dict[attr[0]]=[attr[1]]
    return person_dict
person = person_to_dict(person)

Upvotes: 1

Related Questions