KyleF
KyleF

Reputation: 135

grouping and reordering by partial identifiers in python

I have data from a csv that produces a dataframe that looks like the following:

d = {"clf_2007": [20],
    "e_2007": [25],
    "ue_2007": [17],
    "clf_2008": [300],
    "e_2008": [20],
    "ue_2008": [10]}
df = pd.DataFrame(d)

which produces a data frame (forgive me for not knowing how to properly code that into stackoverflow)

   clf_2007     clf_2008    e_2007  e_2008  ue_2007    ue_2008
0   20            300         25       20      17         10

I want to manipulate that data to produce something that looks like this:

        clf     e     ue
2007    20      25    17
2008    300     20    10

2007 and 2008 in the original column names represent dates, but they don't need to be datetime now. I need to merge them with another dataframe that has the same "dates" eventually, but I can figure that out later.

Thus far, I've tried groupbys and I've tried them by string indexes (like str[ :8]) and such, and, outside of it not working, I don't even think groupby is the right tool. I've also tried pd.PeriodIndex, but, again, that doesn't seem like the right tool to me.

Is there a standardized way to do something like this? Or is the brute force way (get it into an excel spreadsheet and just move the data around manually), the only way to get what I'm looking for here?

Upvotes: 0

Views: 34

Answers (2)

DYZ
DYZ

Reputation: 57105

You can generate a column multiindex:

df.columns = pd.MultiIndex.from_tuples([col.split("_") for col in df])
print(df.columns)
#   clf         e        ue     
#  2007 2008 2007 2008 2007 2008

And then stack the table:

df = df.stack()
print(df)
#        clf   e  ue
#0 2007   20  25  17
#  2008  300  20  10

You can optionally flatten the index, too:

df.index = df.index.get_level_values(1)
print(df)
#      clf   e  ue
#2007   20  25  17
#2008  300  20  10

Upvotes: 1

hoyland
hoyland

Reputation: 1824

I think this will be a lot easier if you pre-process your data to have three columns: key, year and value. Something like:

rows = []
for k, v in d.iteritems():
    key, year = k.split("_")
    for val in v:
        rows.append({'key': key, 'year': year, 'value': val})

Put those rows into a dataframe, call it dfA. I'm assuming you might have more than one value for each (key, year) pair and you want to aggregate them somehow. I'll assume you do that and end up with a dataframe called df, whose columns are still key, year, and value. At that point, you just need to pivot:

pd.pivot_table(df,index=['year'], columns=['key'])

You end up with multi-indexed rows/columns that you'll want to clean up, but I'll leave that to you.

Upvotes: 1

Related Questions