Reputation: 16478
Using Pandas, I have some data that I want to add to my ``results'' dataframe. That is, I have
naics = someData
Which can look like this
indnaics ind1990
89 81393 873
however, it can have more than one row. I want to add these to my results
dataframe, together with a variable called year. In case there is more than one row, it should be the same year
value for all rows. This is what I am trying so far
for job in jobs:
df2 = iGetThisFromJob()
years = df2.year.unique()
naics = iGetThisFromJob()
if len(naics) == 0:
continue
for year in years:
wages = df2.incwage[df2.year == year]
# Add all the data to results, this is how I try it
rows = pd.DataFrame([dict(year=year, incwage=mean(wages), )])
# I also want to add the column indnaics from my naics
rows['naics'] = naics.indnaics
results = results.append(rows, ignore_index=True)
However, despite naics.indnaics being full, I cannot add it this way to the rows object.
naics.indnaics
Out[1052]:
89 81393
rows['naics'] = naics.indnaics rows
Out[1051]:
incwage year naics
0 45853.061224 2002 NaN
If there is anything else that is not nice with my code, please tell. I'm only beginning to learn pandas.
Thanks!
/edit Expected output:
incwage year naics
0 45853.061224 2002 81393
0 45853.061224 2002 12312
/edit Suggested solution:
index = arange(0, len(naics))
columns = ['year', 'incwage', 'naics']
rows = pd.DataFrame(index=index, columns=columns)
rows.year = year
rows.incwage = mean(wages)
rows.naics = naics.indnaics.values
Upvotes: 0
Views: 9066
Reputation: 139132
The reason you get a NaN value, is because the index does not match (in rows['naics'] = naics.indnaics
rows
has index 0, while naics.indnaics
has index 89), and assigning the value will try to align the indices.
You could for example solve that by taking only the value (by eg naics.indnaics.values
). With a toy example:
In [30]: df = pd.DataFrame({'A':[0], 'B':[1]})
In [31]: df
Out[31]:
A B
0 0 1
In [32]: s = pd.Series([2], index=[83])
In [33]: s
Out[33]:
83 2
dtype: int64
In [35]: df['new_column'] = s
In [36]: df
Out[36]:
A B new_column
0 0 1 NaN
In [37]: df['new_column'] = s.values
In [38]: df
Out[38]:
A B new_column
0 0 1 2
If you want to add the series with possibly more values, there are a couple of options. I think of:
Eg reindexing the dataframe first to the length of the series:
In [75]: s
Out[75]:
83 2
84 4
dtype: int64
In [76]: df
Out[76]:
A B
0 0 1
In [77]: df = df.reindex(np.zeros(len(s)))
In [78]: df
Out[78]:
A B
0 0 1
0 0 1
In [79]: df['new_column'] = s.values
In [80]: df
Out[80]:
A B new_column
0 0 1 2
0 0 1 4
or the other way around, add the dataframe to the series (that you first convert to a dataframe):
In [90]: ss = s.to_frame().set_index(np.array([0,0]))
In [91]: ss[df.columns] = df
In [92]: ss
Out[92]:
0 A B
0 2 0 1
0 4 0 1
[2 rows x 3 columns]
Upvotes: 2