Reputation: 107
I'm using pandas profiling to make HTML reports of some dataframes with 150+ attributes. I'd like to extract some of the information and arrange it in a simple table. Specifically, I need the number of missing data in each variable 'n_missing', disctinct values and their corresponding percentages 'p_missing'.
Something I've done, but I'm struggling to get there is this:
df1 = pd.read_excel('df.xlsx')
profile = df1.profile_report(title="Dataset Profiling Report")
profile.to_file('dataset_report.html') #HTML report
profset = profile.description_set #Extracting the info from the profile object
print(profset.keys())
OUT[]: dict_keys(['analysis', 'table', 'variables', 'scatter', 'correlations', 'missing', 'messages', 'package', 'sample', 'duplicates'])
attributes = profset["variables"]
print(attributes.keys())
OUT: dict_keys(['Attribute 1', 'Attribute 2', 'Attribute 3', 'Attribute 4'...]) #All my columns or attributes.
I thought the "missing" key of the profile.description_set keys would be the one, but i get this:
missing = profset["missing"]
print(missing.keys())
dict_keys(['bar', 'matrix', 'heatmap', 'dendrogram'])
And none of these seem to be the right one.
I've been going through pandas profiling documentation and searching online, and only found a single possible example which led me to the previous code I've wrote.
The end goal is a table to export to excel that looks like this (ignore the actual values):
Attributes n_Missing p_missing n_disctinct p_disctinct
Attribute X 23 0.23 2 0.5
Attribute Y 50 0.50 50 1.0
Attribute Z 0 0.00 100 1.0
...
Upvotes: 2
Views: 2769
Reputation: 107
Ok, so after some grinding I answered my own question, and since there seems to be little action around pandas profiling, I'll write share it here and write the code that worked for me.
Its very self explanatory, let me know if you have any doubts.
Full code with dummy dataframe for full reproduction is shown.
Create dataframe with some data in it
dfa = pd.DataFrame(np.random.rand(20,3), columns=['attribute X', 'attribute Y', 'attribute Z'])
dfa.loc[dfa['attribute X']<0.5,'attribute X'] = np.nan
dfa.loc[dfa['attribute X']<0.75,'attribute X'] = 0.6
dfa.loc[dfa['attribute X']>0.75,'attribute X'] = "value X"
dfa
Out[346]:
attribute X attribute Y attribute Z
0 NaN 0.929178 0.439837
1 NaN 0.620489 0.146956
2 0.6 0.971534 0.048539
3 value X 0.113160 0.344989
4 value X 0.766421 0.105712
5 value X 0.706522 0.800705
6 value X 0.451648 0.763452
7 NaN 0.110620 0.976297
8 NaN 0.032442 0.650167
9 0.6 0.532029 0.028842
10 0.6 0.316751 0.010858
11 value X 0.096364 0.381514
12 0.6 0.897551 0.426949
13 value X 0.178309 0.974560
14 NaN 0.461955 0.573013
15 value X 0.446666 0.919223
16 0.6 0.526870 0.751822
17 value X 0.709627 0.334106
18 0.6 0.780467 0.940426
19 value X 0.056930 0.846433
Create the profile report and html report
profile_a = dfa.profile_report()
profile_a.to_file('dfa.html')
Access the keys of the data contained in the report
profset = profile_a.description_set
print(profset.keys())
Access the attributes/variables/columns data analyzed in the report
attributes = profset["variables"]
print(attributes.keys())
Create a list of the attributes analyzed to iterate over
attributes_keys = attributes.keys()
We can consult the available data for each attribute by accessing the keys of the attribute, then select whatever we want
attributes['attribute X'].keys()
Out[]: dict_keys(['n_distinct', 'p_distinct', 'is_unique', 'n_unique', 'p_unique', 'type', 'hashable', 'value_counts_without_nan', 'n_missing', 'n', 'p_missing', 'count', 'memory_size'])
To make the table of the data I decided to extract. I chose whatever needed from the previuos list, then iterated. The following for loop is not efficient, but very clear for anyone.
list_attribute = []
list_missing = []
list_p_missing = [] #'p_missing'
list_disctinct = [] #'n_distinct'
list_p_disctinct = [] #'p_distinct'
for i in attributes_keys:
list_attribute.append(i)
x = attributes[i]
list_missing.append(x['n_missing'])
list_p_missing.append(x['p_missing'])
list_disctinct.append(x['n_distinct'])
list_p_disctinct.append(x['p_distinct'])
df_missing = pd.DataFrame(columns=('Attribute','Missing'))
df_missing['Attribute'] = list_attribute
df_missing['Missing'] = list_missing
df_missing['Percentage Missing'] = list_p_missing
df_missing['Disctinct values'] = list_disctinct
df_missing['Percentage Disctinct'] = list_p_disctinct
df_missing['Percentage Disctinct'] = df_missing['Percentage Disctinct']*100
df_missing['Percentage Missing'] = df_missing['Percentage Missing']*100
Ready
print(df_missing)
Out[345]:
Attribute Missing ... Disctinct values Percentage Disctinct
0 attribute X 0 ... 2 13.333333
1 attribute Y 20 ... 20 100.000000
2 attribute Z 20 ... 20 100.000000
Upvotes: 3