Reputation: 3

Splitting a CSV file by range of datetimes

I have a pretty large CSV file that contains data from 2009-2015. I am wondering if there is an easy way to split this file into smaller files on a per month basis. I could split the data into evenly sized chunks, but I would prefer to group the data by month.

DateTime             Price    Bid    Ask    Size                                  
2009-09-28 09:30:17  35.5250  35.49  35.56  100      
2009-09-28 09:30:18  35.5600  35.49  35.56  100      
2009-09-28 09:30:18  35.5600  35.50  35.57  100      
2009-09-28 09:30:20  35.5000  35.42  35.56  100      
2009-09-28 09:30:20  35.5000  35.42  35.56  100      
2009-09-28 09:30:30  35.4600  35.46  35.52  100      
2009-09-28 09:30:30  35.4600  35.46  35.52  100      
2009-09-28 09:30:30  35.5000  35.46  35.52  100      
2009-09-28 09:30:33  35.5100  35.47  35.51  100      
2009-09-28 09:30:40  35.5100  35.48  35.51  200      
2009-09-28 09:30:41  35.5100  35.48  35.51  100      
2009-09-28 09:30:42  35.4803  35.48  35.51  100      
2009-09-28 09:30:42  35.4800  35.48  35.51  1044      
...                      ...    ...    ...  ...      
2015-04-07 15:59:59  94.1200  94.10  94.12  100      
2015-04-07 16:00:00  94.2000  94.09  94.60  300      
2015-04-07 16:00:00  94.2100  94.09  94.60  100      
2015-04-07 16:00:00  94.1800  94.09  94.60  217      
2015-04-07 16:00:05  94.1100  94.09  94.59  600      
2015-04-07 16:00:09  94.1100  94.09  94.59  350      
2015-04-07 16:00:32  94.1100  94.09  94.59  2804      
2015-04-07 16:00:32  94.1100  94.09  94.59  1582      
2015-04-07 16:00:32  94.1100  94.09  94.59  100      
2015-04-07 16:00:33  94.1100  94.09  94.59  600      
2015-04-07 16:00:35  94.1100  94.09  94.59  16702      

[29195283 rows x 5 columns]

Upvotes: 0

Answers (3)

fixxxer

Reputation: 16144

In [1599]: y.head()
Out[1599]: 
                       Price    Bid    Ask  Size
DateTime                                        
2009-09-28 09:30:17  35.5250  35.49  35.56   100
2009-09-28 09:30:18  35.5600  35.49  35.56   100
2009-09-28 09:30:18  35.5600  35.50  35.57   100
2009-09-28 09:30:20  35.5000  35.42  35.56   100
2009-09-28 09:30:20  35.5000  35.42  35.56   100

If you want to group by month or year, you could do it with:

`pd.groupby(y, by=[y.index.year])`

By month:

In [1597]: pd.groupby(y, by=[y.index.month]).count()
Out[1597]: 
   Price  Bid  Ask  Size
4     11   11   11    11
5      1    1    1     0
9     13   13   13    13

By year:

In [1598]: pd.groupby(y, by=[y.index.year]).count()
Out[1598]: 
      Price  Bid  Ask  Size
2009     13   13   13    13
2015     12   12   12    11

pd.TimeGrouper

In [1604]: y.groupby(pd.TimeGrouper(freq='M')).count().head()
Out[1604]: 
            Price  Bid  Ask  Size
DateTime                         
2009-09-30     13   13   13    13
2009-10-31      0    0    0     0
2009-11-30      0    0    0     0
2009-12-31      0    0    0     0
2010-01-31      0    0    0     0

In [1605]: y.groupby(pd.TimeGrouper(freq='D')).count().head()
Out[1605]: 
            Price  Bid  Ask  Size
DateTime                         
2009-09-28     13   13   13    13
2009-09-29      0    0    0     0
2009-09-30      0    0    0     0
2009-10-01      0    0    0     0
2009-10-02      0    0    0     0

Upvotes: 1

Julien Spronck

Reputation: 15433

In case of very large files, you might not want to put the entire file into a database or a list. You can do this instead.

In this example, I used a very simple regular expression to parse the date. There are more suitable regular expressions for this purpose, but this should work for you.

import re
fileroot = 'blah'

with open(yourfile, 'r') as infile:
    for line in infile:
        datestr = re.match('\d{4}-\d\d-\d\d', line)
        if datestr:
            with open('{0}_{1}.txt'.format(fileroot, datestr.group(0)), 'a') as fil:
               fil.write(line)

Upvotes: 0

jwilner

Reputation: 6606

Try df.groupby((df.datetime.year, df.datetime.month)). This assumes you want to group by year-month pairs, not just lumping every September together, e.g.

Upvotes: 0

Splitting a CSV file by range of datetimes

Answers (3)

pd.groupby(y, by=[y.index.year])

pd.TimeGrouper

Related Questions

`pd.groupby(y, by=[y.index.year])`