mel
mel

Reputation: 2790

Count the number of occurence per column with Pandas

I want to count the number of occurrences per column of two different values the first one is the number of null and the second one is the number of \N in my dataframe. Example I've got:

   A   B   C   D   E   D
1  \N  1   \N  12  1    
2  4       \N  3   0   \N
3      4   M       \N  1

I'm expected the following result:

A  2
B  1
C  2
D  1
E  1
F  2

I already succeed to count the number of missing value with the following code:

df = pd.read_csv("mypath/myFile", sep=',')
null_value = df.isnull().sum()

But the following code doesn't work:

break_line = df[df == '\N'].count()
return break_line + null_value

I get the following error

TypeError: Could not compare ['\N'] with block values

Upvotes: 1

Views: 403

Answers (4)

Mykola Zotko
Mykola Zotko

Reputation: 17882

A solution which uses only vectorized calculations:

df.isna().sum() + (df == '\\N').sum()

Output:

A    2
B    1
C    2
D    1
E    1
F    2

Upvotes: 0

mkln
mkln

Reputation: 14963

one liner:

ns = df.applymap(lambda x: x == '\N').sum(axis = 0)

 null_value + ns

A    2
B    1
C    2
D    1
E    1
F    2

Upvotes: 2

Alexander
Alexander

Reputation: 109686

I assume you only want to count values where the string ends with '\N'. If not, you can use str.contains instead.

I use a dictionary comprehension to loop through the columns of the dataframe and a vectorized str function to count the number of rows with \N at the end.

df = pd.DataFrame({'A': ['\N', 4, None], 
                   'B': [1, None, 4], 
                   'C': ['\N', '\N', 'M'], 
                   'D': [12, 3, None], 
                   'E': [1, 0, '\N'], 
                   'F': [None, '\N', 1]})

>>> df
      A   B   C   D   E     F
0    \N   1  \N  12   1  None
1     4 NaN  \N   3   0    \N
2  None   4   M NaN  \N     1    

>>> pd.Series({col: df[col].str.endswith('\N').sum() 
               if df[col].dtype == 'object' else 0 
               for col in df}) + df.isnull().sum()
A    2
B    1
C    2
D    1
E    1
F    2
dtype: int64

Upvotes: 1

Cleb
Cleb

Reputation: 26027

You can simply do the following using applymap:

df.applymap(lambda x: x == '\N').sum() + df.isnull().sum()

which gives you the desired output:

A      2
B      1
C      2
D      1
E      1
F      2
dtype: int64

Note: You use D twice; I now replaced that by F.

Upvotes: 1

Related Questions