Reputation: 1247
I have csv file that looks like:
1,2,0.2
1,3,0.4
2,1,0.5
2,3,0.8
3,1,0.1
3,2,0.6
First column correspond to user_a
, second to user_b
and third correspond to score
. I want to find for every user_a
, a user_b
value that maximizes the score
. For this example output should look like (output in form of dictionary preferable but not requred):
1 3 0.4
2 3 0.8
3 2 0.6
The problem is that file is very big (millions of rows) and I try to find way to do it without out of memory error. Because of environment setup I cannot use Pandas, Dask and other packages with dataframes.
I used the yield function to keep the memory needed for computation, but I still get an out-of-memory error. Any advice on how to reduce memory consumption would be highly appreciated.
Upvotes: 1
Views: 662
Reputation: 982
Give convtools a try (I must confess - I'm the author)
from convtools import conversion as c
from convtools.contrib.tables import Table
# creates an iterable of rows as dicts
# consider wrapping with tqdm to see the progress
rows = (
Table.from_csv("input.csv", header=["user_a", "user_b", "score"])
.update(score=c.col("score").as_type(float))
.into_iter_rows(dict)
)
# gen_converter writes & compiles ad-hoc python function
converter = (
c.group_by(c.item("user_a"))
.aggregate(
{
"user_a": c.item("user_a"),
"user_b": c.ReduceFuncs.MaxRow(c.item("score")).item("user_b"),
"score": c.ReduceFuncs.MaxRow(c.item("score")).item("score"),
}
)
.gen_converter()
)
Table.from_rows(converter(rows)).into_csv("output.csv", include_header=False)
On my M1 mac - python 3.9: with tqdm
and itertools.cycle
overhead it operates at: 5894860 rows/s. The result is:
1,3,0.4
2,3,0.8
3,2,0.6
However, since the dictionary output is preferable (let's also work with tuples, not dicts, just as an example):
rows = (
Table.from_csv("input.csv", header=["user_a", "user_b", "score"])
.update(score=c.col("score").as_type(float))
.into_iter_rows(tuple)
)
converter = (
c.group_by(c.item(0))
.aggregate(
(
c.item(0),
{
"user_b": c.ReduceFuncs.MaxRow(c.item(2)).item(1),
"score": c.ReduceFuncs.MaxRow(c.item(2)).item(2),
},
)
)
.as_type(dict)
.gen_converter()
)
result = converter(rows)
This operates at 6050825 rows/s and the result is:
In [32]: result
Out[32]:
{'1': {'user_b': '3', 'score': 0.4},
'2': {'user_b': '3', 'score': 0.8},
'3': {'user_b': '2', 'score': 0.6}}
If you still run out of memory -- GNU to the rescue:
The below assumes there are no commas in columns.
sort
: numeric reverse sort by 1st and 3rd columns with ,
as a separatorawk
: take unique ones by first columnsort -t ',' -k 1,3 -n -r input.csv | awk -F ',' -vOFS=, '!($1 in seen) {print $0; seen[$1]}'
3,2,0.6
2,3,0.8
1,3,0.4
Upvotes: 1
Reputation: 11321
In a comment you are saying "Yes, correct, the real data is ordered pairs and sorted". So why can't you just do the following:
import csv
from itertools import groupby
from operator import itemgetter
def max_key(row): return float(row[2])
def find_max_score(file_name, sep=","):
result = {"User_A": [], "User_B": [], "Score": []}
with open(file_name, "r") as file:
reader = csv.reader(file, delimiter=sep)
for user_a, rows in groupby(reader, key=itemgetter(0)):
_, user_b, score = max(rows, key=max_key)
result["User_A"].append(user_a)
result["User_B"].append(user_b)
result["Score"].append(score)
return result
Upvotes: 1