Reputation: 14614
I wish to to the following as fast as possible with Python:
My first code was a loop (i to j)
of the following:
with open('Train.csv', 'rt') as f:
row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
tags = (row[3].decode('utf8'))
return tags
but my code above reads the csv one column at a time and is slow.
How can I read all rows in one call and concatenate fast?
Edit for additional information:
the csv file size is 7GB; I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).
Upvotes: 1
Views: 8223
Reputation: 9801
sed is designed for the task 'read rows i to j of a csv file'.to
If the solution does not have to be pure Python, I think preprocess the csv file with sed sed -n 'i, jp'
, then parse the output with Python would be simple and quick.
Upvotes: 1
Reputation: 21461
Since I know which data you are interested in, I can speak from experience:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
row[0] # ID
row[1] # title
row[2] # body
row[3] # tags
You can of course per row select anything you want, and store it as you like.
By using an iterator variable, you can decide which rows to collect:
import csv
with open('Train.csv', 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
linenum = 0
tags = [] # you can preallocate memory to this list if you want though.
for row in reader:
if linenum > 1000 and linenum < 2000:
tags.append(row[3]) # tags
if linenum == 2000:
break # so it won't read the next 3 million rows
linenum += 1
The good thing about it is also that this will really use low memory as you read in line by line.
As mentioned, if you want the later cases, it still has to parse the data to get there (this is inevitable since there are newlines in the text, so you can't skip to a certain row). Personally, I just roughly used linux's split
, to split the file in chunks, and then edited them making sure they start at an ID (and end with a tag).
Then I used:
train = pandas.io.parsers.read_csv(file, quotechar="\"")
To quickly read in the split files.
Upvotes: 2
Reputation: 15722
Your question does not contain enough information, probably because you don't see some existing complexity: Most CSV files contain one record per line. In that case it's simple to skip the rows you're not interested in. But in CSV records can span lines, so a general solution (like the CSV reader from the standard library) has to parse the records to skip lines. It's up to you to decide what optimization is ok in your use case.
The next problem is, that you don't know, which part of the code you posted, is too slow. Measure it. Your code will never run faster than the time you need to read the file from disc. Have you checked that? Or have you guessed what part's to slow?
If you want to do fast transformations of CSV data which fits to memory, I would propose to use/learn Pandas. So it would probably a good idea to split your code in two steps:
Upvotes: 1
Reputation: 114579
If the file is not HUGE (hundred of megabytes) and you actually need to read a lot of rows then probably just
tags = " ".join(x.split("\t")[3]
for x in open("Train.csv").readlines()[from_row:to_row+1])
is going to be the fastest way.
If the file is instead very big the only thing you can do is iterating over all lines because CSV is uses unfortunately (in general) variable-sized records.
If by chance the specific CSV uses a fixed-size record format (not uncommon for large files) then directly seeking into the file may be an option.
If the file uses variable-sized records and the search must be done several times with different ranges then creating a simple external index just once (e.g. line->file offset for all line numbers that are a multiple of 1000) can be good idea.
Upvotes: 1