illuminato
illuminato

Reputation: 1247

How to read and process a file in Python that is too big for memory?

I have csv file that looks like:

1,2,0.2
1,3,0.4
2,1,0.5
2,3,0.8
3,1,0.1
3,2,0.6

First column correspond to user_a, second to user_b and third correspond to score. I want to find for every user_a, a user_b value that maximizes the score. For this example output should look like (output in form of dictionary preferable but not requred):

1 3 0.4
2 3 0.8
3 2 0.6

The problem is that file is very big (millions of rows) and I try to find way to do it without out of memory error. Because of environment setup I cannot use Pandas, Dask and other packages with dataframes.

I used the yield function to keep the memory needed for computation, but I still get an out-of-memory error. Any advice on how to reduce memory consumption would be highly appreciated. 

Upvotes: 1

Views: 662

Answers (2)

westandskif
westandskif

Reputation: 982

Give convtools a try (I must confess - I'm the author)

from convtools import conversion as c
from convtools.contrib.tables import Table

# creates an iterable of rows as dicts
# consider wrapping with tqdm to see the progress
rows = (
    Table.from_csv("input.csv", header=["user_a", "user_b", "score"])
    .update(score=c.col("score").as_type(float))
    .into_iter_rows(dict)
)

# gen_converter writes & compiles ad-hoc python function
converter = (
    c.group_by(c.item("user_a"))
    .aggregate(
        {
            "user_a": c.item("user_a"),
            "user_b": c.ReduceFuncs.MaxRow(c.item("score")).item("user_b"),
            "score": c.ReduceFuncs.MaxRow(c.item("score")).item("score"),
        }
    )
    .gen_converter()
)
Table.from_rows(converter(rows)).into_csv("output.csv", include_header=False)

On my M1 mac - python 3.9: with tqdm and itertools.cycle overhead it operates at: 5894860 rows/s. The result is:

1,3,0.4
2,3,0.8
3,2,0.6

However, since the dictionary output is preferable (let's also work with tuples, not dicts, just as an example):

rows = (
    Table.from_csv("input.csv", header=["user_a", "user_b", "score"])
    .update(score=c.col("score").as_type(float))
    .into_iter_rows(tuple)
)
converter = (
    c.group_by(c.item(0))
    .aggregate(
        (
            c.item(0),
            {
                "user_b": c.ReduceFuncs.MaxRow(c.item(2)).item(1),
                "score": c.ReduceFuncs.MaxRow(c.item(2)).item(2),
            },
        )
    )
    .as_type(dict)
    .gen_converter()
)
result = converter(rows)

This operates at 6050825 rows/s and the result is:

In [32]: result
Out[32]:
{'1': {'user_b': '3', 'score': 0.4},
 '2': {'user_b': '3', 'score': 0.8},
 '3': {'user_b': '2', 'score': 0.6}}

If you still run out of memory -- GNU to the rescue:

The below assumes there are no commas in columns.

  • sort: numeric reverse sort by 1st and 3rd columns with , as a separator
  • awk: take unique ones by first column

sort -t ',' -k 1,3 -n -r input.csv | awk -F ',' -vOFS=, '!($1 in seen) {print $0; seen[$1]}'

3,2,0.6
2,3,0.8
1,3,0.4

Upvotes: 1

Timus
Timus

Reputation: 11321

In a comment you are saying "Yes, correct, the real data is ordered pairs and sorted". So why can't you just do the following:

import csv
from itertools import groupby
from operator import itemgetter

def max_key(row): return float(row[2])

def find_max_score(file_name, sep=","):
    result = {"User_A": [], "User_B": [], "Score": []}
    with open(file_name, "r") as file:
        reader = csv.reader(file, delimiter=sep)
        for user_a, rows in groupby(reader, key=itemgetter(0)):
            _, user_b, score = max(rows, key=max_key)
            result["User_A"].append(user_a)
            result["User_B"].append(user_b)
            result["Score"].append(score)
    return result

Upvotes: 1

Related Questions