Extract specific values from pandas profiling to a table

Question

I'm using pandas profiling to make HTML reports of some dataframes with 150+ attributes. I'd like to extract some of the information and arrange it in a simple table. Specifically, I need the number of missing data in each variable 'n_missing', disctinct values and their corresponding percentages 'p_missing'.

Something I've done, but I'm struggling to get there is this:

df1 = pd.read_excel('df.xlsx')
profile = df1.profile_report(title="Dataset Profiling Report")
profile.to_file('dataset_report.html')                            #HTML report


profset = profile.description_set          #Extracting the info from the profile object
print(profset.keys())

OUT[]: dict_keys(['analysis', 'table', 'variables', 'scatter', 'correlations', 'missing', 'messages', 'package', 'sample', 'duplicates'])

attributes = profset["variables"]
print(attributes.keys())
OUT: dict_keys(['Attribute 1', 'Attribute 2', 'Attribute 3', 'Attribute 4'...]) #All my columns or attributes.

I thought the "missing" key of the profile.description_set keys would be the one, but i get this:

missing = profset["missing"]
print(missing.keys())
dict_keys(['bar', 'matrix', 'heatmap', 'dendrogram'])

And none of these seem to be the right one.

I've been going through pandas profiling documentation and searching online, and only found a single possible example which led me to the previous code I've wrote.

The end goal is a table to export to excel that looks like this (ignore the actual values):

Attributes    n_Missing    p_missing    n_disctinct   p_disctinct
Attribute X          23         0.23              2           0.5
Attribute Y          50         0.50             50           1.0
Attribute Z           0         0.00            100           1.0
...

Extract specific values from pandas profiling to a table

Answers (1)

Related Questions