Reputation: 1302
I have to replace values in a large CSV file and decided for Python as programming language I want to use.
The value I need to change is the first on each line in my comma separated CSV:
ToReplace, a1, a2, ..., aN
1, ab, cd, ..., xy
80, ka, kl, ..., df
It's always a number, the amount if digits isn't fixed, though.
I've got two ideas at the moment: Process the data line by line and ...
As I'm very new to Python there are some questions that came to mind:
Upvotes: 0
Views: 817
Reputation: 3923
You can pass a second argument to Python's split
method in order to get just the first match, replace that with whatever you want, then join back into a single string, like this:
import logging
with open('example.csv', 'rb') as infile, \
open('result.csv', 'wb') as outfile:
for line in in file:
try:
number, rest = line.split(',', 1)
number = 'blob'
outfile.write(','.join([number, rest]))
except ValueError:
logging.error('The following line had no separator: %s', line)
For 10 million rows, on 2 cores at 2.4 GHz and 8 Gb RAM, I get the following times:
$ time python example.py
real 0m20.771s
user 0m20.336s
sys 0m0.369s
Upvotes: 0
Reputation: 414149
If you want to replace the first column that always contains a number then you could use a string method instead of a more general csv
module, to avoid parsing the whole line:
#!/usr/bin/env python
def main():
with open('50gb_file', 'rb') as file, open('output', 'wb') as output_file:
for line in file:
number, sep, rest = line.partition(b',')
try:
number = int(number)*2 #XXX replace number here
except ValueError:
pass # don't replace the number
else:
line = bytes(number) + sep + rest
output_file.write(line)
main()
Upvotes: 2