Viv
Viv

Reputation: 1584

How to profile large datasets with Pandas profiling?

Data is not perfectly clean, but is used without issue with pandas. The pandas library provides many extremely useful functions for EDA.

But when I use profiling for large data i.e 100 million records with 10 columns, reading it from a database table, it does not complete and my laptop runs out of memory, the size of data in csv is around 6 gb and my RAM is 14 GB my idle usage is around 3 - 4 GB approximately.

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")

I have also tried with check_recoded = False option as well. But it does not help in profiling entirely. Is there any way to chunk and read the data and finally generate the summary report as a whole? OR any other method to use this function with large dataset.

Upvotes: 13

Views: 25690

Answers (5)

Another option is to reduce the data.

One option may be achieved with sample:

df.sample(number)

More details on pandas documentation.

Upvotes: 0

Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39950

v2.4 introduced the minimal mode that disables expensive computations (such as correlations and dynamic binning):

from pandas_profiling import ProfileReport


profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")

Upvotes: 15

cptnJ
cptnJ

Reputation: 230

The syntax to disable the calculation of correlations (thereby heavily reducing calculations) has changed a lot between pandas-profiling=1.4 and the current (beta-)version pandas-profiling=2.0 to the following:

profile = df.profile_report(correlations={
    "pearson": False,
    "spearman": False,
    "kendall": False,
    "phi_k": False,
    "cramers": False,
    "recoded":False,}
)

Additionally, you can reduce performed calculations by disabling the calculations of bins for the plotting of histograms.

profile = df.profile_report(plot={'histogram': {'bins': None}}

Upvotes: 5

Amitesh Verma
Amitesh Verma

Reputation: 11

The ability to disable the check correlation has been added with the implementation of the issue #43 which is not part of the latest version of pandas-profiling (1.4) available in PyPI. It has been implemented after and will be available, I guess, in the next version. In the meantime, if you really need it, you can download the current version from github and use it for example by adding it to your PYTHONPATH.

!/bin/sh

PROF_DIR="$HOME/Git/pandas-profiling/"

export PYTHONPATH="$PYTHONPATH:$PROF_DIR"

jupyter notebook

Upvotes: -2

Ashutosh Kumar
Ashutosh Kumar

Reputation: 321

Did you try with the below option as when doing correlation analysis on large free text fields using pandas profiling might cause this issue?

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df, , check_correlation = False)

Please refer the below github link for more details: https://github.com/pandas-profiling/pandas-profiling/issues/84

Upvotes: 1

Related Questions