Reputation: 159
I am working with more than 6MM rows of ticker symbol data. I would like to grab all of the data for a symbol, do the processing I need, and output the results.
I have written code that tells me what line each ticker starts on (see the code below). I am thinking it would be more efficient if I knew what position a new symbol starts at (instead of the line number) so I could use seek(#) to easily jump to a ticker's starting position. I am also curious as to how to expand this logic to read an entire block of data (start_position to end_position) for a ticker.
import csv
data_line = 0 # holds the file line number for the symbol
ticker_start = 0
ticker_end = 0
cur_sec_ticker = ""
ticker_dl = [] # array for holding the line number in the source file for the start of each ticker
reader = csv.reader(open('C:\\temp\sample_data.csv', 'rb'), delimiter=',')
for row in reader:
if cur_sec_ticker != row[1]: # only process a new ticker
ticker_fr = str(data_line) + ',' + row[1] # prep line for inserting into array
# desired line for inserting into array, ticker_end would be the last
# of the current ticker data block, which is the start of the next ticker
# block (ticker_start - 1)
#ticker_fr = str(ticker_start) + str(ticker_end) + str(data_line) + ',' + row[1]
print ticker_fr
ticker_dl.append(ticker_fr)
cur_sec_ticker = row[1]
data_line += 1
print ticker_dl
Below I have placed a small sample of how the data file:
seq,Symbol,Date,Open,High,Low,Close,Volume,MA200Close,MA50Close,PrimaryLast,filter_$
1,A,1/1/2008,36.74,36.74,36.74,36.74,0, , ,1,1
2,A,1/2/2008,36.67,36.8,36.12,36.3,1858900, , ,1,1
3,A,1/3/2008,36.3,36.35,35.87,35.94,1980100, , ,1,1
1003,AA,1/1/2008,36.55,36.55,36.55,36.55,0, , ,1,1
1004,AA,1/2/2008,36.46,36.78,36,36.13,7801600, , ,1,1
1005,AA,1/3/2008,36.18,36.67,35.74,36.19,7169000, , ,1,1
2005,AAN,4/20/2009,20,20.7,18.2067,18.68,808700, , ,1,1
2006,AAN,4/21/2009,18.7,19.06,18.6533,18.9933,530200, , ,1,1
2007,AAN,4/22/2009,19.2867,19.6267,18.54,19.1333,801100, , ,1,1
2668,AAP,1/1/2008,37.99,37.99,37.99,37.99,0, , ,1,1
2669,AAP,1/2/2008,37.99,38.15,37.17,37.59,1789200, , ,1,1
2670,AAP,1/3/2008,37.58,38.16,37.35,37.95,1584700, , ,1,1
3670,AAR,1/1/2008,22.94,22.94,22.94,22.94,0, , ,1,1
3671,AAR,1/2/2008,23.1,23.38,22.86,23.15,17100, , ,1,1
3672,AAR,1/3/2008,23,23,22,22.16,45600, , ,1,1
6886,ABB,1/1/2008,28.8,28.8,28.8,28.8,0, , ,1,1
6887,ABB,1/2/2008,29,29.11,28.23,28.64,4697700, , ,1,1
6888,ABB,1/3/2008,27.92,28.35,27.79,28.08,5240100, , ,1,1
Upvotes: 0
Views: 1685
Reputation: 104722
In general, you can get the current position of a file object with the tell
method. However, it may be difficult to get that to work with your current code which delegates the file reading to the csv
module. It's even hard to do it when reading line by line, since the underlying file object will probably get read in larger chunks than a single line (the readline
and readlines
methods do some caching in the background to hide this from you).
While I'd skip the whole idea of reading specific bytes, if it's really worth while for your program you'll probably need to take charge of the file reading yourself so that you can keep track of exactly where you are in the file at all times. tell
probably isn't necessary.
Something like this might work to read a chunk of data and then split it into lines and values while keeping track of how many bytes have been read so far:
def generate_values(f):
buf = "" # a buffer of data read from the file
pos = 0 # the position of our buffer within the file
while True: # loop until we return at the end of the file
new_data = f.read(4096) # read up to 4k bytes at a time
if not new_data: # quit if we got nothing
if buf:
yield pos, buf.split(",") # handle any data after last newline
return
buf += new_data
line_start = 0 # index into buf
try:
while True: # loop until an exception is raised at end of buf
line_end = buf.index("\n", line_start) # find end of line
line = buf[line_start:line_end] # excludes the newline
if line: # skips blank lines
yield pos+line_start, line.split(",") # yield pos,data tuple
line_start = line_end+1
except ValueError: # raised by `index()`
pass
pos += line_end + 1
buf = buf[line_end + 1:] # keep left over data from end of the buffer
This might need a little tweaking if your file has line endings other than \n
, but it shouldn't be too hard.
Upvotes: 1