Reputation: 3
I have a pretty large CSV file that contains data from 2009-2015. I am wondering if there is an easy way to split this file into smaller files on a per month basis. I could split the data into evenly sized chunks, but I would prefer to group the data by month.
DateTime Price Bid Ask Size
2009-09-28 09:30:17 35.5250 35.49 35.56 100
2009-09-28 09:30:18 35.5600 35.49 35.56 100
2009-09-28 09:30:18 35.5600 35.50 35.57 100
2009-09-28 09:30:20 35.5000 35.42 35.56 100
2009-09-28 09:30:20 35.5000 35.42 35.56 100
2009-09-28 09:30:30 35.4600 35.46 35.52 100
2009-09-28 09:30:30 35.4600 35.46 35.52 100
2009-09-28 09:30:30 35.5000 35.46 35.52 100
2009-09-28 09:30:33 35.5100 35.47 35.51 100
2009-09-28 09:30:40 35.5100 35.48 35.51 200
2009-09-28 09:30:41 35.5100 35.48 35.51 100
2009-09-28 09:30:42 35.4803 35.48 35.51 100
2009-09-28 09:30:42 35.4800 35.48 35.51 1044
... ... ... ... ...
2015-04-07 15:59:59 94.1200 94.10 94.12 100
2015-04-07 16:00:00 94.2000 94.09 94.60 300
2015-04-07 16:00:00 94.2100 94.09 94.60 100
2015-04-07 16:00:00 94.1800 94.09 94.60 217
2015-04-07 16:00:05 94.1100 94.09 94.59 600
2015-04-07 16:00:09 94.1100 94.09 94.59 350
2015-04-07 16:00:32 94.1100 94.09 94.59 2804
2015-04-07 16:00:32 94.1100 94.09 94.59 1582
2015-04-07 16:00:32 94.1100 94.09 94.59 100
2015-04-07 16:00:33 94.1100 94.09 94.59 600
2015-04-07 16:00:35 94.1100 94.09 94.59 16702
[29195283 rows x 5 columns]
Upvotes: 0
Views: 566
Reputation: 16144
In [1599]: y.head()
Out[1599]:
Price Bid Ask Size
DateTime
2009-09-28 09:30:17 35.5250 35.49 35.56 100
2009-09-28 09:30:18 35.5600 35.49 35.56 100
2009-09-28 09:30:18 35.5600 35.50 35.57 100
2009-09-28 09:30:20 35.5000 35.42 35.56 100
2009-09-28 09:30:20 35.5000 35.42 35.56 100
If you want to group by month or year, you could do it with:
pd.groupby(y, by=[y.index.year])
By month:
In [1597]: pd.groupby(y, by=[y.index.month]).count()
Out[1597]:
Price Bid Ask Size
4 11 11 11 11
5 1 1 1 0
9 13 13 13 13
By year:
In [1598]: pd.groupby(y, by=[y.index.year]).count()
Out[1598]:
Price Bid Ask Size
2009 13 13 13 13
2015 12 12 12 11
In [1604]: y.groupby(pd.TimeGrouper(freq='M')).count().head()
Out[1604]:
Price Bid Ask Size
DateTime
2009-09-30 13 13 13 13
2009-10-31 0 0 0 0
2009-11-30 0 0 0 0
2009-12-31 0 0 0 0
2010-01-31 0 0 0 0
In [1605]: y.groupby(pd.TimeGrouper(freq='D')).count().head()
Out[1605]:
Price Bid Ask Size
DateTime
2009-09-28 13 13 13 13
2009-09-29 0 0 0 0
2009-09-30 0 0 0 0
2009-10-01 0 0 0 0
2009-10-02 0 0 0 0
Upvotes: 1
Reputation: 15433
In case of very large files, you might not want to put the entire file into a database or a list. You can do this instead.
In this example, I used a very simple regular expression to parse the date. There are more suitable regular expressions for this purpose, but this should work for you.
import re
fileroot = 'blah'
with open(yourfile, 'r') as infile:
for line in infile:
datestr = re.match('\d{4}-\d\d-\d\d', line)
if datestr:
with open('{0}_{1}.txt'.format(fileroot, datestr.group(0)), 'a') as fil:
fil.write(line)
Upvotes: 0
Reputation: 6606
Try df.groupby((df.datetime.year, df.datetime.month))
. This assumes you want to group by year-month pairs, not just lumping every September together, e.g.
Upvotes: 0