Schach21
Schach21

Reputation: 450

Processing data using pandas in a memory efficient manner using Python

I have to read multiple csv files and group them by "event_name". I also might have some duplicates, so I need to drop them. paths contains all the paths of the csv files, and my code is as follows:

data = []
for path in paths:
    csv_file = pd.read_csv(path)
    data.append(csv_file)

events = pd.concat(data)
events = events.drop_duplicates()

event_names = events.groupby('event_name')

ev2 = []

for name, group in event_names:
    a, b = group.shape
    ev2.append([name, a])

This code is going to tell me how many unique event_name unique there are, and how many entries per event_name. It works wonderfully, except that the csv files are too large and I am having memory problems. Is there a way to do the same using less memory?

I read about using dir() and globals() to delete variables, which I could certainly use, because once I have event_names, I don't need the DataFrame events any longer. However, I am still having those memory issues. My question more specifically is: can I read the csv files in a more memory-efficient way? or is there something additional I can do to reduce memory usage? I don't mind sacrificing performance, as long as I can read all csv files at once, instead of doing chunk by chunk.

Upvotes: 0

Views: 124

Answers (1)

Bill Huang
Bill Huang

Reputation: 4648

Just keep a hash value of each row to reduce the data size.

csv_file = pd.read_csv(path)

# compute hash (gives an `uint64` value per row)
csv_file["hash"] = pd.util.hash_pandas_object(csv_file)

# keep only the 2 columns relevant to counting
data.append(csv_file[["event_name", "hash"]])

If you cannot risk hash collision (which would be astronomically unlikely), just use another hash key and check if the final counting results are identical. The way to change a hash key is as follows.

# compute hash using a different hash key
csv_file["hash2"] = pd.util.hash_pandas_object(csv_file, hash_key='stackoverflow')

Reference: pandas official docs page

Upvotes: 1

Related Questions