night_bat
night_bat

Reputation: 3232

Concatenate large files in sql-like way with limited RAM

I have a large A.csv file (~5 Gb) with several columns. One of the columns is Model. There is another large B.csv file (~15 Gb) with Vendor, Name and Model columns.

Two questions:

1) How can I create result file that combines all columns from A.csv and corresponding Vendor and Name from B.csv (join on Model). The trick is - how to do it when my RAM is 4 Gb only, and I'm using python.

2) How can I create a sample (say, 1 Gb) result file that combines random subsample from A.csv (all columns) joined with Vendor and Name from B.csv. The trick is, again, in 4 Gb of RAM.

I know how to do it in pandas, but 4 Gb is limiting factor I can't overcome (

Upvotes: 0

Views: 969

Answers (3)

Stefan Pochmann
Stefan Pochmann

Reputation: 28606

Here's an idea:

Step 1: Sort the two files by Model. Mergesort would be good for this. Split each file into smaller files small enough to sort in RAM, then after sorting each of them, merge them into one large sorted file. See my answer to an earlier question for a good way to merge multiple already sorted files. Update: See end of my answer for example/code.

Step 2: Join the two files by Model. Again similar to the merging step of Mergesort, traverse the two sorted files "in parallel", advancing each file appropriately, and join by matching Model values.

Pseudocode for step 2:

open the two sorted files A and B
blockA = read block of same-model rows from A
blockB = read block of same-model rows from B
while True:
    while model of blockA differs from model of blockB:
        if model of blockA is smaller:
            blockA = read block of same-model rows from A
            quit if there isn't any (i.e. end of file reached)
        else:
            blockB = read block of same-model rows from B
            quit if there isn't any (i.e. end of file reached)
    output the cross product of blockA and blockB

Another idea:

If there are relatively few Models, then it might be better to separate the rows into files by Model. For example, store rows in files A_Model1.csv, A_Model2.csv, etc and B_Model1.csv, B_Model2.csv, etc. Then take the cross product of A_Model1.csv and B_Model1.csv, of A_Model2.csv and B_Model2.csv, etc.


For question 2, I'd just count the number of rows, use random.sample to pick row numbers, and then fetch those rows.

>>> import random
>>> number_of_rows = 100
>>> number_of_sample_rows = 10
>>> sorted(random.sample(range(number_of_rows), number_of_sample_rows))
[6, 18, 23, 32, 41, 44, 58, 59, 91, 96]

(then go through the file and fetch those rows)

Update: Here's code/demo for Step 2 from above. I made three files B1.csv, B2.csv and B3.csv:

Vendor,Name,Model
vfoo,nhi,m1
vbar,nho,m4
vbaz,nhe,m7

Vendor,Name,Model
vZ,nX,m2
vY,nZ,m6
vX,nY,m8

Vendor,Name,Model
v,n3,m3
v,na,m5
v,n_,m9

Here's the merging result file Bmerged.csv:

Vendor,Name,Model
vfoo,nhi,m1
vZ,nX,m2
v,n3,m3
vbar,nho,m4
v,na,m5
vY,nZ,m6
vbaz,nhe,m7
vX,nY,m8
v,n_,m9

And here's the code:

import csv, heapq

filenames = ('B1.csv', 'B2.csv', 'B3.csv')

# Prepare the input streams
files = list(map(open, filenames))
readers = [iter(csv.reader(file)) for file in files]
headers = list(map(next, readers))
def model_and_row(row):
    return row[2], row
model_and_row_streams = [map(model_and_row, reader) for reader in readers]

# Merge them into the output file
with open('Bmerged.csv', 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(headers[0])
    for _, row in heapq.merge(*model_and_row_streams):
        writer.writerow(row)

# Close the input files
for file in files:
    file.close()

Note that I'm using Python 3. In Python 2, you'd need to use itertools.imap(model_and_row, reader) in order to not read the whole files into memory at once.

Upvotes: 2

rassa45
rassa45

Reputation: 3550

Read the file line by line in Python. This is a really quick and easy way: example

output = open("outputfile.csv", "a")
lines = []
for line in open("file.csv", "r"):
    lines.append(line)
    if len(lines) == 1000000:
        output.writelines(lines)
        del lines[:]
if bool(lines):
    output.writelines(lines)

Adjust the length of the array in the if statement according to available RAM

Upvotes: -1

Fil
Fil

Reputation: 1060

As @Marc B said, reading one row at a time is the solution. About the join I would do the following (pseudocode: I don't know python).

  1. "Select distinct Model from A" on first file A.csv

Read all rows, search for Model field and collect distinct values in a list/array/map

  1. "Select distinct Model from B" on second file B.csv

Same operation as 1, but using another list/array/map

  1. Find matching models

Compare the two lists/arrays/maps finding only matching models (they will be part of the join)

  1. Do the join

Reading rows of file A which match model, read all the rows of file B which match same model and write a file C with join result. To this for all models.

Note: it's not particularly optimized.

For point 2 just choose a subset of matching models and/or read a part of rows of file A and/or B with maching models.

Upvotes: 0

Related Questions