Táwros
Táwros

Reputation: 144

Quickest Pandas Value Updating Method?

I'm going through over 1 million patent applications and have to fix the dates, in addition to other things that I will work on later. I'm reading the file into a Pandas data frame, then running the following function:

def date_change():
        new_dates = {'m/y': []}
        for i, row in apps.iterrows():
                try:
                        d = row['date'].rsplit('/')
                        new_dates['m/y'].append('{}/19{}'.format(d[0], d[2]))
                except Exception as e:
                        print('{}   {}\n{}\n{}'.format(i, e, row, d))
                        new_dates['m/y'].append(np.nan)
        apps.join(pd.DataFrame(new_dates))
        apps.drop('date')

Is there a quicker way of executing this? Is Pandas even the correct library to be using with a dataset this large? I've been told PySpark is good for big data, but how much will it improve the speed?

Upvotes: 0

Views: 155

Answers (1)

Quickbeam2k1
Quickbeam2k1

Reputation: 5437

So it seems like you are using a string to represent data instead of a date time object. I'd suggest to do something like

df['date'] = pd.to_datetime(df['date'])

So you don't need to iterate at all, as that function operate on the whole column. And then, you might want to check the following answer which uses dt.strftime to format your column appropriately.

If you could show input and expected output, I could add the full solution here.

Besides, 1 million rows should typically be manageable for pandas (depending on the number of columns of course)

Upvotes: 1

Related Questions