Reputation: 159
I have a large csv file. But for simplicity i have removed many rows and columns. It looks like below:
col1 | col2 | col3 |
---|---|---|
? | 27 | 13000 |
? | 27 | 13000 |
validvalue | 30 | |
# | 26 | 14000 |
validvalue | 25 |
I want to detect missing values in this csv file. For e.g: missing values indicated in col1 is by ? and #. In col3 by empty cells. Things would have been easier if the data set has empty cells for all missing values. In that case i could have gone for isnull function of pandas dataframe. But the question is how to identify if the columns has other than empty space as missing value.
Approach if the csv has low number of records
df = pd.read_csv('test.csv')
for e in df.columns:
print(a[e].unique()]
This will give us all unique value in that particular columns. But i dont find it efficient.
Is their any other way to detect missing values which are denoted by special characters such as (?,#,* etc.) in the csv file?
Upvotes: 2
Views: 3031
Reputation: 1155
As you already stated
there is no way to find the garbage value other than using "unique" function.
But if the number of possible values is big you might help yourself, using .isalnum()
to limit the values only to non-alfanumerical strings. For example:
df = pd.DataFrame({"col1": ['?', '?', 'validvalue', '$', 'validvalue'],
"col2": [27, 27, 30, 26, 25],
"col3": [13000, 13000, None, 14000, None]})
df[~df['col1'].str.isalnum()]['col1'].value_counts()
#Output:
#? 2
#$ 1
When you will find all possible NA values, you might use mask
on each column (if missings differ column to column) or on whole dataset, for example:
na_values = ('?', '#')
df.mask(df.isin(na_values))
Upvotes: 3