Reputation: 168
I have a time series of stellar flux that is periodic in nature. For this data I created a DataFrame with a time column and flux column.
file: Path = localdir / file_path.csv
est_period: float = #number_I_estimated
df = DataFrame(file, names=['t','f'])
df['stack_t'] = df['t'] % est_period
stacked = df[['stack','f']].sort_values(by='stack')
I create a new column by applying the modulus %
operation to the time 't'
series with the estimated period and stack the series on top of itself by calling df.sort_values(by='stack_t')
.
I noticed that DataFrame.sort_values(inplace=True)
seems to not reindex the data set. If you sort the data set, then find the minimum of f
over mask=stacked['stack_t']>somenumber
then it turns out that argmin(stacked[mask]['f'])
returns the index from df
, not from stacked
.
To get this to work it turns out that I have to manually reindex the array:
stacked = df[['stack','f']].sort_values(by='stack').reindex(range(0,len(df)))
Is this expected behavior? .sort_values
already returns a copy of df
. Why is the copy not reindexed?
Upvotes: 0
Views: 41
Reputation: 1752
That's a perfectly normal and desirable behavior of DataFrame.sort()
, and not a bug.
Indexes are usually thought as identifiers for data points, not as indicators of position (you can think on them as dictionary keys). So, when you sort the data or select a subset of it, you want to keep the correspondent Id for each point, even if this breaks the original sequential order (if there was some).
IMO, the solution you found is precisely the reason because .reindex()
exists: to solve those cases where you don't want to keep the original association.
Upvotes: 1