Reputation: 27
I have a CSV file and I want to check for each row if it has one or more values in different columns which I specified in a list. If there is no value in any column it should add up to a counter so I know how many rows are empty. But if it has one value in one column from the list it shouldn't do anything.
The CSV file is like this:
I made the code below but it is returning 0 which is not correct.
import pandas as pd
testfile = 'test1.csv'
df = pd.read_csv(testfile)
column_names = ['Uniprot_acc',
'Uniprot_id',
'Interpro_domain',
'Ensembl_geneid',
'Ensembl_transcriptid',
'SIFT_score',
'SIFT_pred']
counter = 0
for row in df:
for column_name in column_names:
if column_name in row:
if column_name == None:
counter =+ 1
print(counter)
What I want to know is how many rows don't contain anything. It should check per row for every column in the list if there is no value. And if indeed there is nothing in the row it should count. So in this example it should be 3.
Upvotes: 2
Views: 4824
Reputation: 862731
Use:
counter = df[column_names].isnull().all(axis=1).sum()
print (counter)
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'Uniprot_acc':[np.nan,5,4,5,np.nan,4],
'Uniprot_id':[np.nan,8,9,4,np.nan,np.nan],
'Interpro_domain':[np.nan,3,np.nan,7,np.nan,0],
'E':[5,3,np.nan,9,np.nan,4],
})
column_names = ['Uniprot_acc',
'Uniprot_id',
'Interpro_domain']
print (df)
A Uniprot_acc Uniprot_id Interpro_domain E
0 a NaN NaN NaN 5.0
1 b 5.0 8.0 3.0 3.0
2 c 4.0 9.0 NaN NaN
3 d 5.0 4.0 7.0 9.0
4 e NaN NaN NaN NaN
5 f 4.0 NaN 0.0 4.0
counter = df[column_names].isnull().all(axis=1).sum()
print (counter)
2
Explanation:
First filter columns by list:
print (df[column_names])
Uniprot_acc Uniprot_id Interpro_domain
0 NaN NaN NaN
1 5.0 8.0 3.0
2 4.0 9.0 NaN
3 5.0 4.0 7.0
4 NaN NaN NaN
5 4.0 NaN 0.0
Then check missing values None
and NaN
s:
print (df[column_names].isnull())
Uniprot_acc Uniprot_id Interpro_domain
0 True True True
1 False False False
2 False False True
3 False False False
4 True True True
5 False True False
Check if all Trues per rows by DataFrame.all
:
print (df[column_names].isnull().all(axis=1))
0 True
1 False
2 False
3 False
4 True
5 False
dtype: bool
And last count only True
s by sum
.
Upvotes: 3