FooBar
FooBar

Reputation: 16478

Adding a Pandas Dataframe-column to a new dataframe

Using Pandas, I have some data that I want to add to my ``results'' dataframe. That is, I have

naics = someData

Which can look like this

   indnaics  ind1990
89    81393      873

however, it can have more than one row. I want to add these to my results dataframe, together with a variable called year. In case there is more than one row, it should be the same year value for all rows. This is what I am trying so far

for job in jobs:
    df2 =  iGetThisFromJob()
    years = df2.year.unique()
    naics = iGetThisFromJob()
    if len(naics) == 0:
        continue

    for year in years:
        wages = df2.incwage[df2.year == year]
    # Add all the data to results, this is how I try it
        rows = pd.DataFrame([dict(year=year, incwage=mean(wages), )])
    # I also want to add the column indnaics from my naics 
        rows['naics'] = naics.indnaics
        results = results.append(rows, ignore_index=True)

However, despite naics.indnaics being full, I cannot add it this way to the rows object.

naics.indnaics

Out[1052]: 
89    81393

rows['naics'] = naics.indnaics rows

Out[1051]: 
        incwage  year naics
0  45853.061224  2002   NaN

If there is anything else that is not nice with my code, please tell. I'm only beginning to learn pandas.

Thanks!

/edit Expected output:

        incwage  year   naics
0  45853.061224  2002   81393
0  45853.061224  2002   12312

/edit Suggested solution:

index = arange(0, len(naics))
columns = ['year', 'incwage', 'naics']
rows = pd.DataFrame(index=index, columns=columns)
rows.year = year
rows.incwage = mean(wages)
rows.naics = naics.indnaics.values

Upvotes: 0

Views: 9066

Answers (1)

joris
joris

Reputation: 139132

The reason you get a NaN value, is because the index does not match (in rows['naics'] = naics.indnaics rows has index 0, while naics.indnaics has index 89), and assigning the value will try to align the indices.

You could for example solve that by taking only the value (by eg naics.indnaics.values). With a toy example:

In [30]: df = pd.DataFrame({'A':[0], 'B':[1]})
In [31]: df
Out[31]: 
   A  B
0  0  1


In [32]: s = pd.Series([2], index=[83])
In [33]: s
Out[33]: 
83    2
dtype: int64

In [35]: df['new_column'] = s
In [36]: df
Out[36]: 
   A  B  new_column
0  0  1         NaN

In [37]: df['new_column'] = s.values
In [38]: df
Out[38]: 
   A  B  new_column
0  0  1           2

If you want to add the series with possibly more values, there are a couple of options. I think of:

Eg reindexing the dataframe first to the length of the series:

In [75]: s
Out[75]: 
83    2
84    4
dtype: int64

In [76]: df
Out[76]: 
   A  B
0  0  1

In [77]: df = df.reindex(np.zeros(len(s)))
In [78]: df
Out[78]: 
   A  B
0  0  1
0  0  1

In [79]: df['new_column'] = s.values

In [80]: df
Out[80]: 
   A  B  new_column
0  0  1           2
0  0  1           4

or the other way around, add the dataframe to the series (that you first convert to a dataframe):

In [90]: ss = s.to_frame().set_index(np.array([0,0]))
In [91]: ss[df.columns] = df
In [92]: ss
Out[92]: 
   0  A  B
0  2  0  1
0  4  0  1

[2 rows x 3 columns]

Upvotes: 2

Related Questions