sats
sats

Reputation: 159

To check missing values in csv file using Pandas

I have a large csv file. But for simplicity i have removed many rows and columns. It looks like below:

col1 col2 col3
? 27 13000
? 27 13000
validvalue 30
# 26 14000
validvalue 25

I want to detect missing values in this csv file. For e.g: missing values indicated in col1 is by ? and #. In col3 by empty cells. Things would have been easier if the data set has empty cells for all missing values. In that case i could have gone for isnull function of pandas dataframe. But the question is how to identify if the columns has other than empty space as missing value.

Approach if the csv has low number of records

df = pd.read_csv('test.csv')
for e in df.columns:
    print(a[e].unique()]

This will give us all unique value in that particular columns. But i dont find it efficient.

Is their any other way to detect missing values which are denoted by special characters such as (?,#,* etc.) in the csv file?

Upvotes: 2

Views: 3031

Answers (1)

Daniel Wlazło
Daniel Wlazło

Reputation: 1155

As you already stated

there is no way to find the garbage value other than using "unique" function.

But if the number of possible values is big you might help yourself, using .isalnum() to limit the values only to non-alfanumerical strings. For example:

df = pd.DataFrame({"col1": ['?', '?', 'validvalue', '$', 'validvalue'],
                   "col2": [27, 27, 30, 26, 25],
                   "col3": [13000, 13000, None, 14000, None]})

df[~df['col1'].str.isalnum()]['col1'].value_counts()

#Output:
#?    2
#$    1

When you will find all possible NA values, you might use mask on each column (if missings differ column to column) or on whole dataset, for example:

na_values = ('?', '#')
df.mask(df.isin(na_values))

Upvotes: 3

Related Questions