Working on 50 million rows in pandas (python)

Question

I am working on a dataframe of 50 million rows in pandas. I need to run through a column and extract specific parts of the text. The column has string values defined in 4 or 5 patterns. I need to extract the text and replace the original string. I am using the apply function and regex for this. This takes me close to a day to execute. I feel this is inefficient. Or is this normal? Is there an approach i am missing to make it faster?

Back2Basics · Accepted Answer

here are the docs:

http://pandas.pydata.org/pandas-docs/stable/indexing.html

http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings

Replacing text is easy. No a day isn't normal. Get rid of all the lists you had in an earlier version of this post. You don't need them. Add on columns to the dataframe if you need more space for data. Learn the data types to make the data smaller.

import pandas as pd
df = pd.DataFrame()  #import your data at this step
df['column'].str.extract(regex_thingy_here)

I'd write more but you took the code down.

Working on 50 million rows in pandas (python)

Answers (1)

Related Questions