Reputation: 1496
I'm using pandas to analyse financial records.
I have a DataFrame
that comes from a csv file that looks like this:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 800 entries, 2010-10-27 00:00:00 to 2011-07-12 00:00:00
Data columns:
debit 800 non-null values
transaction_type 799 non-null values
transaction_date_raw 800 non-null values
credit 800 non-null values
transaction_description 800 non-null values
account_number 800 non-null values
sort_code 800 non-null values
balance 800 non-null values
dtypes: float64(3), int64(1), object(4)
I am selecting a subset based on transaction amount:
c1 = df['credit'].map(lambda x: x > 1000)
milestones = df[c1].sort()
and want to create slices of the original df based on the dates between the milestones:
delta = dt.timedelta(days=1)
for i in range(len(milestones.index)-1):
start = milestones.index[i].date()
end = milestones.index[i+1].date() - delta
rng = date_range(start, end)
this generates a new series with the dates between my milestones.
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-11-29 00:00:00, ..., 2010-12-30 00:00:00]
Length: 32, Freq: D, Timezone: None
I have followed several approaches to slice my df using these new series (rng) but have failed:
df.ix[start:end] or
df.ix[rng]
this raises: IndexError: invalid slice
df.reindex(rng) or df.reindex(index=rng)
raises: Exception: Reindexing only valid with uniquely valued Index objects
x = [v for v in rng if v in df.index]
df[x]
df.ix[x]
df.index[x]
this also raises invalid slice, and so does:
df.truncate(start, end)
I'm new to pandas, I'm following the early release of the book from Oreilly, and really enjoying it. Any pointers would be appreciated.
Upvotes: 6
Views: 7184
Reputation: 1496
I've managed to circumvent some of the issues highlighted above, here is a "solution" until the bugs mentioned by Chang She are resolved.
I start with my original TimeSeries indexed DataFrame as before. I sort the df, this sorts records by date (using the TimeSeries index).
df = df.sort()
once sorted I replace the df.index with a numerical index
df.index = range(len(df))
I subsequently extract my milestones as before with the difference now these will have an index which is a positive int, and create a list of that index:
milestones_list = milestones_df.index.tolist()
and extract the data between my milestones from the original df using the numeric index like so:
datasets = {}
for milestone in milestones_list:
milestone_index = milestones_list.index(milestone)
print "milestone {0} index {1}".format(milestone, milestone_index)
if milestone_index < len(milestones_list) -1:
x = df[milestones_df.index[milestone_index]:milestones_df.index[milestone_index+1]]
else:
x = df[milestones_df.index[milestone_index]:df.index.max()]
n = str(int(x.index.min())) +'-'+ str(int(x.index.max()))
datasets[n] = x
this creates a dict with DataFrames for each milestone time interval named as the index intervals they represent.
print datasets.keys()
['592-650', '448-527', '382-447', '264-318', '319-381', '118-198', '528-591', '728-798', '54-117', '199-263', '651-727']
Although admittedly not the ideal solution I hope it helps someone with similar issues.
Upvotes: 1
Reputation: 16960
It looks like you've hit a couple of known bugs in non-unique index handling:
https://github.com/pydata/pandas/issues/1201/
https://github.com/pydata/pandas/issues/1587/
A bug fix release is coming out very soon so please check the pandas website or PyPI in a week or so.
Thanks
Upvotes: 1