Reputation: 4063
In python pandas I have a dataframe
df_aaa:
date data otherdata symbol
2015/1/1 11 12 aaa
2015/2/1 21 22 aaa
2015/3/1 31 31 aaa
df_all:
2015/1/1 31 31 bbb
Currently the index of both is by date
.
I want to append df_aaa to df_all, and have them with a composite index of both symbol and date.
Basically the following are all one question: How do I set a multi-index and use it when appending. Can I do it with different column order? Do I need to refresh? Etc.:
I'm not sure if a multi-index
is an index that has multiple 'columns' (or rows), or is it the ability to have more than one index (and any of them could be for multiple columns or rows). Or are both correct?
Must I first set the index of both dataframes to a multi-index, so the append will work? (otherwise I'll have duplicates for different symbols
Do I have to "drop" the existing index before creating the new one?
Is there such a thing as a dataframe with data but no index?
Must a (single) index be of unique values?
When do I use which of the following dataframe methods: set_index()
, reindex()
, reset_index()
, set_level
, reset_level
?
Do I have to add anything (like axis=1
) when setting the index?
How do I set the index to be the data in a column. (And why does sometimes using ['symbol', 'date'] as a parameter, give me a new column with those two values, instead of setting the index on the existing values of the columns with those two names?)
After I append and assuming the old index is correct do I need to 'update' the index (perhaps using reindex?) or since I told the dataframe that the index is in a certain column, is my data correctly indexed?
And since my dataframes (will) have indices on the same column name, can I do an append of df_aaa on df_all even if df_all was defined to have the columns originally in a different order. (say: ['symbol', 'date', 'data', 'otherdata'] with symbol the first column)?
Upvotes: 1
Views: 4682
Reputation: 4063
Here is what I gather from the answers and dragging through the docs:
There is a "default index" which is a "row-number" for each row, and which is not part of any of the columns.
When merging with that index, there (seems to be) no need to re-index.
But if I want to change the index after it was made "non-standard" I have to "reset_index()" and turn it back to the default, and then from there I can create the new multi index (as explained in the revisioned answer below)
A multi-index is one that has more than one key (i.e. if indexing the rows, then more than one column will be used).
I'm still not sure if you have to re-index a column after a merge, but according to this it seems you get an automatically generated new "default index" and have to save the old one, remove the index before merge (reset_index) and set it again when done.
The other question about the index replacing a column - I'll check and get back here.
This is a follow-up.
Upvotes: 1
Reputation: 4747
Just append df's and reset_index()
to be able to set_index()
with keys
argument. Here's oneliner:
df_all = df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
And here is full working sample.
In [1]: import pandas as pd
...: from io import StringIO
...:
In [2]: df_aaa = pd.read_csv(StringIO("""date data otherdata symbol
...: 2015/1/1 11 12 aaa
...: 2015/2/1 21 22 aaa
...: 2015/3/1 31 31 aaa
...: """), sep="\s+", index_col='date')
...:
In [3]: df_all = pd.read_csv(StringIO("""date data otherdata symbol
...: 2015/1/1 31 31 bbb"""), sep="\s+", index_col='date')
...:
In [4]: df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
Out[4]:
data otherdata
symbol date
bbb 2015/1/1 31 31
aaa 2015/1/1 11 12
2015/2/1 21 22
2015/3/1 31 31
Upvotes: 3
Reputation: 403130
You can just concatenate them and then set the index.
df_aaa = df_aaa.reset_index()
df_all = df_all.reset_index()
df = df_aaa.append(df_all).set_index(['symbol', 'date'])
Note that this would work only if your dataframes have the same column.s
If you must perform multiple appends in the future, the best thing to do would be to get one of them in the shape of the other, perform the concatenation, and reset index as needed.
I'll answer all your questions one by one.
I'm not sure if a multi-index is an index that has multiple 'columns' (or rows), or is it the ability to have more than one index (and any of them could be for multiple columns or rows). Or are both correct?
It depends on what axis you're referring to. Along the row (0th axis), you have 2 or more columns forming a MultiIndex. Similarly for along the columns (1st axis).
Must I first set the index of both dataframes to a multi-index, so the append will work? (otherwise I'll have duplicates for different symbols
No need. Although you could, not doing so would be simpler in this case.
Do I have to "drop" the existing index before creating the new one? No, just that the columns must align (column name and number of columns should be the same).
Is there such a thing as a dataframe with data but no index?
No. All rows are indexed. Even if there is no column as the index, the index is a monotonically increasing number. The model followed here is similar to that in RDBMs.
Must a (single) index be of unique values?
In general, the must, so rows can be uniquely identified. If you have a MultiIndex
, each combination of values that make up the index must be unique.
When do I use which of the following dataframe methods: set_index(), reindex(), reset_index(), set_level, reset_level?
This is a broad question. It depends, when do you want to operate on the index and if so, what do you want to do with it? Look at the documentation for each one carefully.
Upvotes: 4