Reputation: 450
I have to read multiple csv files and group them by "event_name"
. I also might have some duplicates, so I need to drop them. paths
contains all the paths of the csv files, and my code is as follows:
data = []
for path in paths:
csv_file = pd.read_csv(path)
data.append(csv_file)
events = pd.concat(data)
events = events.drop_duplicates()
event_names = events.groupby('event_name')
ev2 = []
for name, group in event_names:
a, b = group.shape
ev2.append([name, a])
This code is going to tell me how many unique event_name
unique there are, and how many entries per event_name
. It works wonderfully, except that the csv files are too large and I am having memory problems. Is there a way to do the same using less memory?
I read about using dir()
and globals()
to delete variables, which I could certainly use, because once I have event_names
, I don't need the DataFrame events
any longer. However, I am still having those memory issues. My question more specifically is: can I read the csv files in a more memory-efficient way? or is there something additional I can do to reduce memory usage? I don't mind sacrificing performance, as long as I can read all csv files at once, instead of doing chunk by chunk.
Upvotes: 0
Views: 124
Reputation: 4648
Just keep a hash value of each row to reduce the data size.
csv_file = pd.read_csv(path)
# compute hash (gives an `uint64` value per row)
csv_file["hash"] = pd.util.hash_pandas_object(csv_file)
# keep only the 2 columns relevant to counting
data.append(csv_file[["event_name", "hash"]])
If you cannot risk hash collision (which would be astronomically unlikely), just use another hash key and check if the final counting results are identical. The way to change a hash key is as follows.
# compute hash using a different hash key
csv_file["hash2"] = pd.util.hash_pandas_object(csv_file, hash_key='stackoverflow')
Reference: pandas official docs page
Upvotes: 1