Reputation: 135
I have data from a csv that produces a dataframe that looks like the following:
d = {"clf_2007": [20],
"e_2007": [25],
"ue_2007": [17],
"clf_2008": [300],
"e_2008": [20],
"ue_2008": [10]}
df = pd.DataFrame(d)
which produces a data frame (forgive me for not knowing how to properly code that into stackoverflow)
clf_2007 clf_2008 e_2007 e_2008 ue_2007 ue_2008
0 20 300 25 20 17 10
I want to manipulate that data to produce something that looks like this:
clf e ue
2007 20 25 17
2008 300 20 10
2007 and 2008 in the original column names represent dates, but they don't need to be datetime now. I need to merge them with another dataframe that has the same "dates" eventually, but I can figure that out later.
Thus far, I've tried groupbys and I've tried them by string indexes (like str[ :8]) and such, and, outside of it not working, I don't even think groupby is the right tool. I've also tried pd.PeriodIndex, but, again, that doesn't seem like the right tool to me.
Is there a standardized way to do something like this? Or is the brute force way (get it into an excel spreadsheet and just move the data around manually), the only way to get what I'm looking for here?
Upvotes: 0
Views: 34
Reputation: 57105
You can generate a column multiindex:
df.columns = pd.MultiIndex.from_tuples([col.split("_") for col in df])
print(df.columns)
# clf e ue
# 2007 2008 2007 2008 2007 2008
And then stack the table:
df = df.stack()
print(df)
# clf e ue
#0 2007 20 25 17
# 2008 300 20 10
You can optionally flatten the index, too:
df.index = df.index.get_level_values(1)
print(df)
# clf e ue
#2007 20 25 17
#2008 300 20 10
Upvotes: 1
Reputation: 1824
I think this will be a lot easier if you pre-process your data to have three columns: key
, year
and value
. Something like:
rows = []
for k, v in d.iteritems():
key, year = k.split("_")
for val in v:
rows.append({'key': key, 'year': year, 'value': val})
Put those rows into a dataframe, call it dfA
. I'm assuming you might have more than one value for each (key, year)
pair and you want to aggregate them somehow. I'll assume you do that and end up with a dataframe called df
, whose columns are still key
, year
, and value
. At that point, you just need to pivot:
pd.pivot_table(df,index=['year'], columns=['key'])
You end up with multi-indexed rows/columns that you'll want to clean up, but I'll leave that to you.
Upvotes: 1