Max
Max

Reputation: 51

pd.to_csv saves, but apparently the wrong data (according to print function)

this is my first post on stackoverflow. Please be easy on me, if I don't follow the common styleguide correctly.

I am doing the kaggle challenge "predict house_prices". My first step is to preprocess the dataset. There are empty cells in the code "NaN". With df["Headline"].fillNA("NA") I change it to "NA" which, in this challenge, is defined as not further described.

The print function shows, that the approach works. At the end of it, I want to save my modified DataFrame into a .csv file (you can see path and filename in the code). However, while the .csv does indeed save, the data apparently is wrong. So, I guess I must've done a mistake with the syntax of pd.to_csv.

First, here's my code. Afterwards, you find what the console says about the modified dataframe "maindf" and the dataframe of my .csv file "csvdf". Sorry for the poor formatting with the console by the way.

import os
import pandas as pd
import numpy as np

#Variables
PRICE = []
CRIT = []

#Directories
DATADIR = r"C:\Users\Hp\Desktop\Project_Arcus\house_price\data"
DATA = "train.csv"
path = os.path.join(DATADIR, DATA)
MODFILE = "train_modified.csv"
mod_path = os.path.join(DATADIR, MODFILE)

print(f"Training Data is {path}")
print(f"Modified Training Data is{mod_path}")

# Goal: Open the document of the chosen path. Extract data (f. e. the headline)
df = pd.read_csv(path)
maindf = df # this step is unnecessary, but it helped me to better understand.

# Goal: Check for empty cells. Replace them with a fitting value, so the neural network can 
# threat them accordingly. Save the .csv under a new name.
maindf["PoolQC"] = df["PoolQC"].fillna("NA")
maindf["MiscFeature"] = df["MiscFeature"].fillna("NA")
maindf["Alley"] = df["Alley"].fillna("NA")
maindf["Fence"] = df["Fence"].fillna("NA")
maindf["FireplaceQu"] = df["FireplaceQu"].fillna("NA")
maindf.to_csv(mod_path,index=True) # index=False means there will be no row names (index).

# Next Goal: Save the dataframe df into a csv document "train_modified.csv"  WORKS
# Check if the new file is correct.                                     Not correct! NaN included...!

#print(df.isnull().sum())
csvdf = pd.read_csv(mod_path)
#print(csvdf.isnull().sum())
print(maindf["PoolQC"].head(10))
print(csvdf["PoolQC"].head(10))

Training Data is C:\Users\Hp\Desktop\Project_Arcus\house_price\data\train.csv Modified Training Data is C:\Users\Hp\Desktop\Project_Arcus\house_price\data\train_modified.csv 0 NA 1 NA 2 NA 3 NA 4 NA 5 NA 6 NA 7 NA 8
NA 9 NA Name: PoolQC, dtype: object 0
NaN 1 NaN 2 NaN 3
NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN Name: PoolQC, dtype: object

Upvotes: 3

Views: 474

Answers (1)

jpp
jpp

Reputation: 164843

The issue isn't with to_csv, it's with read_csv, the documentation for which states:

na_values : scalar, str, list-like, or dict, default None

By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

Instead, define keep_default_na and na_values arguments when you use read_csv:

csvdf = pd.read_csv(mod_path, keep_default_na=False, na_values='')

You may wish to supply a list of values for na_values: if used with keep_default_na=False, Pandas will consider only those values as NaN.

A better idea is to use a less ambiguous string than 'NA' to represent data you don't want to be read as NaN.

Upvotes: 1

Related Questions