JPFrancoia
JPFrancoia

Reputation: 5629

Memory Error: happening on Linux but not Mac OS

I have a big pandas dataframe (7 GiB) that I read from a csv. I need to merge this dataframe with another one, much much smaller. Let's say its size is negligible.

I'm aware that a merge operation in pandas will keep the 2 dataframes to merge + the merged dataframe. Since I have only 16 GiB of RAM, when I run the merge on Linux, it fails with a memory error (my system consumes around 3-4 GiB).

I also tried to run the merge on a Mac, with 16 GiB as well. The system consumes about 3 GiB of RAM by default. The merge completed on the Mac, with the memory going no higher than 10 GiB.

How is this possible? The version of pandas is the same, the dataframe is the same. What is happening here?

Edit:

Here is the code I use to read/merge my files:

# Read the data for the stations, stored in a separate file
stations = pd.read_csv("stations_with_id.csv", index_col=0)
stations.set_index("id_station")

list_data = list()
data = pd.DataFrame()

# Merge all pollutants data in one dataframe
# Probably not the most optimized approach ever...
for pollutant in POLLUTANTS:
    path_merged_data_per_pollutant = os.path.join("raw_data", f"{pollutant}_merged")

    print(f"Pollutant: {pollutant}")

    for f in os.listdir(path_merged_data_per_pollutant):

        if ".csv" not in f:
            print(f"passing {f}")
            continue

        print(f"loading {f}")

        df = pd.read_csv(
            os.path.join(path_merged_data_per_pollutant, f),
            sep=";",
            na_values="mq",
            dtype={"concentration": "float64"},
        )

        # Drop useless colums and translate useful ones to english
        # Do that here to limit memory usage
        df = df.rename(index=str, columns=col_to_rename)
        df = df[list(col_to_rename.values())]

        # Date formatted as YYYY-MM
        df["date"] = df["date"].str[:7]

        df.set_index("id_station")
        df = pd.merge(df, stations, left_on="id_station", right_on="id_station")

        # Filter entries to France only (only the metropolitan area) based on GPS coordinates
        df = df[(df.longitude > -5) & (df.longitude < 12)]

        list_data.append(df)

    print("\n")

data = pd.concat(list_data)

The only column that is not a string is concentration, and I specify the type when I read the csv. The stations dataframe is < 1 MiB.

Upvotes: 1

Views: 390

Answers (1)

gilch
gilch

Reputation: 11651

MacOS compresses memory since Mavericks. If your dataframe is not literally random, it won't take up the full 7GiB in RAM.

There are ways to get compressed memory on Linux as well, but this isn't necessarily enabled. It depends on your distro and configuration.

Upvotes: 2

Related Questions