Reputation: 1584
Data is not perfectly clean, but is used without issue with pandas. The pandas library provides many extremely useful functions for EDA.
But when I use profiling for large data i.e 100 million records with 10 columns, reading it from a database table, it does not complete and my laptop runs out of memory, the size of data in csv is around 6 gb and my RAM is 14 GB my idle usage is around 3 - 4 GB approximately.
df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")
I have also tried with check_recoded = False
option as well.
But it does not help in profiling entirely.
Is there any way to chunk and read the data and finally generate the summary report as a whole? OR any other method to use this function with large dataset.
Upvotes: 13
Views: 25690
Reputation: 326
Another option is to reduce the data.
One option may be achieved with sample
:
df.sample(number)
More details on pandas documentation.
Upvotes: 0
Reputation: 39950
v2.4
introduced the minimal mode that disables expensive computations (such as correlations and dynamic binning):
from pandas_profiling import ProfileReport
profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")
Upvotes: 15
Reputation: 230
The syntax to disable the calculation of correlations (thereby heavily reducing calculations) has changed a lot between pandas-profiling=1.4
and the current (beta-)version pandas-profiling=2.0
to the following:
profile = df.profile_report(correlations={
"pearson": False,
"spearman": False,
"kendall": False,
"phi_k": False,
"cramers": False,
"recoded":False,}
)
Additionally, you can reduce performed calculations by disabling the calculations of bins for the plotting of histograms.
profile = df.profile_report(plot={'histogram': {'bins': None}}
Upvotes: 5
Reputation: 11
The ability to disable the check correlation has been added with the implementation of the issue #43 which is not part of the latest version of pandas-profiling (1.4) available in PyPI. It has been implemented after and will be available, I guess, in the next version. In the meantime, if you really need it, you can download the current version from github and use it for example by adding it to your PYTHONPATH.
PROF_DIR="$HOME/Git/pandas-profiling/"
export PYTHONPATH="$PYTHONPATH:$PROF_DIR"
jupyter notebook
Upvotes: -2
Reputation: 321
Did you try with the below option as when doing correlation analysis on large free text fields using pandas profiling might cause this issue?
df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df, , check_correlation = False)
Please refer the below github link for more details: https://github.com/pandas-profiling/pandas-profiling/issues/84
Upvotes: 1