Reputation: 281
I want to determine how full each column in a .csv file is, add these to a list in order of how full each column is. The fullness should be expressed as a percentage.
The .csv file is really large, so it would be useful to determine which columns contain a small amount of data, and which contain the most. Therefore the columns that have more data will be more useful to me.
What I've gotten so far:
import pandas as pd
ranked_list = []
csv_filepath = r"some_path_here"
data = pd.read_csv(filepath)
for column in data:
way_to_calculate_percentage
ranked_list.append(way_to_calculate_percentage)
print(sorted(ranked_list))
I would like to know if there is some way to determine this "way_to_calculate_percentage"
Cheers!
Upvotes: 3
Views: 4089
Reputation: 862761
Check non missing values by DataFrame.notna
and count mean
if need percentage of non missing values:
data = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,np.nan,np.nan,4],
'C':[7,8,9,4,2,3],
'D':[1,3,np.nan,7,1,0],
})
s1 = data.notna().mean()
print (s1)
A 1.000000
B 0.500000
C 1.000000
D 0.833333
dtype: float64
If need percentage of missing values use DataFrame.isna
with mean
:
s2 = data.isna().mean().sort_values()
print (s2)
A 0.000000
C 0.000000
D 0.166667
B 0.500000
dtype: float64
Then is possible analyze values - with Series.nlargest
,
Series.nsmallest
and if necessary use Series.sort_values
:
s3 = s2.nlargest(2)
print (s3)
B 0.500000
D 0.166667
dtype: float64
s4 = s2.nsmallest(2)
print (s4)
A 0.0
C 0.0
dtype: float64
s5 = s2.sort_values()
print (s5)
A 0.000000
C 0.000000
D 0.166667
B 0.500000
dtype: float64
Upvotes: 5
Reputation: 298
My solution is memory footprint which provides the usage size.
import pandas as pd
import os
dir_path = 'M:/Desktop/Python-Test/'
test_file = os.path.join(dir_path, 'test_file.csv')
pd1 = pd.read_csv(test_file)
print(pd1.memory_usage(index=False, deep=True))
Upvotes: 0
Reputation: 609
Does this help?
df
Out[13]:
ColumnA ColumnB ColumnC ColumnD
0 TypeA A a x
1 TypeA B NaN x
2 TypeA C b x
3 TypeA D NaN x
4 TypeA E NaN x
5 TypeB F NaN x
6 TypeB A g x
7 TypeC B NaN x
8 TypeC Z NaN NaN
9 TypeC C NaN NaN
10 TypeD A h NaN
df.notna().sum()/len(df)*100
Out[14]:
ColumnA 100.000000
ColumnB 100.000000
ColumnC 36.363636
ColumnD 72.727273
dtype: float64
Upvotes: 1
Reputation: 93171
Assuming you have the following dataframe:
a b
0 NaN NaN
1 1.0 NaN
2 2.0 NaN
3 3.0 4.0
You can calculate the percentage of each column like this:
null_percent = df.isnull().sum() / df.shape[0]
Result:
a 0.25
b 0.75
dtype: float64
Upvotes: 1