Reputation: 3232
I have a large A.csv file (~5 Gb) with several columns. One of the columns is Model
.
There is another large B.csv file (~15 Gb) with Vendor
, Name
and Model
columns.
Two questions:
1) How can I create result file that combines all columns from A.csv and corresponding Vendor
and Name
from B.csv (join on Model
). The trick is - how to do it when my RAM is 4 Gb only, and I'm using python.
2) How can I create a sample (say, 1 Gb) result file that combines random subsample from A.csv (all columns) joined with Vendor
and Name
from B.csv. The trick is, again, in 4 Gb of RAM.
I know how to do it in pandas, but 4 Gb is limiting factor I can't overcome (
Upvotes: 0
Views: 969
Reputation: 28606
Here's an idea:
Step 1: Sort the two files by Model. Mergesort would be good for this. Split each file into smaller files small enough to sort in RAM, then after sorting each of them, merge them into one large sorted file. See my answer to an earlier question for a good way to merge multiple already sorted files. Update: See end of my answer for example/code.
Step 2: Join the two files by Model. Again similar to the merging step of Mergesort, traverse the two sorted files "in parallel", advancing each file appropriately, and join by matching Model values.
Pseudocode for step 2:
open the two sorted files A and B
blockA = read block of same-model rows from A
blockB = read block of same-model rows from B
while True:
while model of blockA differs from model of blockB:
if model of blockA is smaller:
blockA = read block of same-model rows from A
quit if there isn't any (i.e. end of file reached)
else:
blockB = read block of same-model rows from B
quit if there isn't any (i.e. end of file reached)
output the cross product of blockA and blockB
Another idea:
If there are relatively few Models, then it might be better to separate the rows into files by Model. For example, store rows in files A_Model1.csv
, A_Model2.csv
, etc and B_Model1.csv
, B_Model2.csv
, etc. Then take the cross product of A_Model1.csv
and B_Model1.csv
, of A_Model2.csv
and B_Model2.csv
, etc.
For question 2, I'd just count the number of rows, use random.sample
to pick row numbers, and then fetch those rows.
>>> import random
>>> number_of_rows = 100
>>> number_of_sample_rows = 10
>>> sorted(random.sample(range(number_of_rows), number_of_sample_rows))
[6, 18, 23, 32, 41, 44, 58, 59, 91, 96]
(then go through the file and fetch those rows)
Update: Here's code/demo for Step 2 from above. I made three files B1.csv, B2.csv and B3.csv:
Vendor,Name,Model
vfoo,nhi,m1
vbar,nho,m4
vbaz,nhe,m7
Vendor,Name,Model
vZ,nX,m2
vY,nZ,m6
vX,nY,m8
Vendor,Name,Model
v,n3,m3
v,na,m5
v,n_,m9
Here's the merging result file Bmerged.csv:
Vendor,Name,Model
vfoo,nhi,m1
vZ,nX,m2
v,n3,m3
vbar,nho,m4
v,na,m5
vY,nZ,m6
vbaz,nhe,m7
vX,nY,m8
v,n_,m9
And here's the code:
import csv, heapq
filenames = ('B1.csv', 'B2.csv', 'B3.csv')
# Prepare the input streams
files = list(map(open, filenames))
readers = [iter(csv.reader(file)) for file in files]
headers = list(map(next, readers))
def model_and_row(row):
return row[2], row
model_and_row_streams = [map(model_and_row, reader) for reader in readers]
# Merge them into the output file
with open('Bmerged.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(headers[0])
for _, row in heapq.merge(*model_and_row_streams):
writer.writerow(row)
# Close the input files
for file in files:
file.close()
Note that I'm using Python 3. In Python 2, you'd need to use itertools.imap(model_and_row, reader)
in order to not read the whole files into memory at once.
Upvotes: 2
Reputation: 3550
Read the file line by line in Python. This is a really quick and easy way: example
output = open("outputfile.csv", "a")
lines = []
for line in open("file.csv", "r"):
lines.append(line)
if len(lines) == 1000000:
output.writelines(lines)
del lines[:]
if bool(lines):
output.writelines(lines)
Adjust the length of the array in the if statement according to available RAM
Upvotes: -1
Reputation: 1060
As @Marc B said, reading one row at a time is the solution. About the join I would do the following (pseudocode: I don't know python).
Read all rows, search for Model field and collect distinct values in a list/array/map
Same operation as 1, but using another list/array/map
Compare the two lists/arrays/maps finding only matching models (they will be part of the join)
Reading rows of file A which match model, read all the rows of file B which match same model and write a file C with join result. To this for all models.
Note: it's not particularly optimized.
For point 2 just choose a subset of matching models and/or read a part of rows of file A and/or B with maching models.
Upvotes: 0