Reputation: 5629
I have a big pandas dataframe (7 GiB) that I read from a csv. I need to merge this dataframe with another one, much much smaller. Let's say its size is negligible.
I'm aware that a merge operation in pandas will keep the 2 dataframes to merge + the merged dataframe. Since I have only 16 GiB of RAM, when I run the merge on Linux, it fails with a memory error (my system consumes around 3-4 GiB).
I also tried to run the merge on a Mac, with 16 GiB as well. The system consumes about 3 GiB of RAM by default. The merge completed on the Mac, with the memory going no higher than 10 GiB.
How is this possible? The version of pandas is the same, the dataframe is the same. What is happening here?
Edit:
Here is the code I use to read/merge my files:
# Read the data for the stations, stored in a separate file
stations = pd.read_csv("stations_with_id.csv", index_col=0)
stations.set_index("id_station")
list_data = list()
data = pd.DataFrame()
# Merge all pollutants data in one dataframe
# Probably not the most optimized approach ever...
for pollutant in POLLUTANTS:
path_merged_data_per_pollutant = os.path.join("raw_data", f"{pollutant}_merged")
print(f"Pollutant: {pollutant}")
for f in os.listdir(path_merged_data_per_pollutant):
if ".csv" not in f:
print(f"passing {f}")
continue
print(f"loading {f}")
df = pd.read_csv(
os.path.join(path_merged_data_per_pollutant, f),
sep=";",
na_values="mq",
dtype={"concentration": "float64"},
)
# Drop useless colums and translate useful ones to english
# Do that here to limit memory usage
df = df.rename(index=str, columns=col_to_rename)
df = df[list(col_to_rename.values())]
# Date formatted as YYYY-MM
df["date"] = df["date"].str[:7]
df.set_index("id_station")
df = pd.merge(df, stations, left_on="id_station", right_on="id_station")
# Filter entries to France only (only the metropolitan area) based on GPS coordinates
df = df[(df.longitude > -5) & (df.longitude < 12)]
list_data.append(df)
print("\n")
data = pd.concat(list_data)
The only column that is not a string is concentration
, and I specify the type when I read the csv.
The stations dataframe is < 1 MiB.
Upvotes: 1
Views: 390
Reputation: 11651
MacOS compresses memory since Mavericks. If your dataframe is not literally random, it won't take up the full 7GiB in RAM.
There are ways to get compressed memory on Linux as well, but this isn't necessarily enabled. It depends on your distro and configuration.
Upvotes: 2