Lisa Siurina
Lisa Siurina

Reputation: 33

How to convert a column in a pandas dataframe to datatime?

I have a csv document, an example below:

oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no

There are 2 columns, representing dates, I want to change column type from string to datetime format.

The column creation (string) is the date of creation and can be expressed in three forms in addition to be unspecified:

  1. "yyyy-mm-dd" (e.g. "2019-09-20")
  2. "yyyy-mm" (e.g. "2019-09")
  3. "yyyy" (e.g. "2019")

The column timespan (string): is represented according to the form PnYnMnD, where P is a literal value that starts the expression, nY is the number of years followed by a literalY, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, where any of these numbers and corresponding designators may be absent if they are equal to 0. A minus sign may appear before the P to indicate negative duration.

I am trying to change column type from string to datetime format using pd.to_datetime() function:

def do_process(f_path):
    global my_ocan

    my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'], parse_dates = ['creation', 'timespan'])
    my_ocan['timespan'] = pd.to_datetime(my_ocan['timespan'], format='%Y%m%d', errors='ignore', yearfirst=True)
    my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format='%Y%m%d', errors='ignore', yearfirst=True)
    #print(my_ocan['citing'])
    print(my_ocan.info())

    return my_ocan

When executing print(my_ocan.info()) I get '214 non-null object' instead of datetime. What am I missing? What's the issue?

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 7 columns):
oci           214 non-null object
citing        214 non-null object
cited         214 non-null object
creation      214 non-null object
timespan      214 non-null object
journal_sc    214 non-null object
author_sc     214 non-null object
dtypes: object(7)

Thank you and have an awesome day y'all :)

Upvotes: 0

Views: 214

Answers (1)

chris
chris

Reputation: 114

There are a couple of issues to address in your code.

First, notice that in your .csv file the first column is:

oci,citing,cited,creation,timespan,journal_sc,author_sc

So when you're building a database with pd.read_csv the first row of your dataframe will be the first row of your .csv file. You end up with your dataframe being :

                                                 oci  ...  author_sc
0                                                oci  ...  author_sc
1  0200100000236252421370109080537010700020300040...  ...         no

instead of :

[2 rows x 7 columns]
                                                 oci  ... author_sc
1  0200100000236252421370109080537010700020300040...  ...        no

I don't think you want that.

You also suppress errors with pd.to_datetime which isn't great. Once you remove errors='ignore', you'll notice that the conversion to datetime fails because the input doesn't match the '%Y%m%d' format.

And of course it doesn't since your creation column is:

0    creation
1     1985-04

And the first row is "creation" which doesn't match '%Y%m%d'.

Second, the date format should be '%Y-%m-%d', since the dates are of a YYYY-MM-DD format and not YYYYMMDD. (Careful, because when your format has days in it for example and the input doesn't, a day is added to the date).

Third, the timespan values aren't dates, they're periods (a Java period if I'm not mistaken), so using pd.to_datetime on them doesn't work. I haven't found a python function that does the conversion for you so you might have to do the work yourself in a good old string parsing fashion, or dig deeper in python libs.

Fourth, as the documentation says, pd.to_datetime is deprecated so you'd do well to stay away and use for example datetime.strptime instead which does a very similar job.

Finally, here's a working version of your code:

 def do_process(f_path):
     global my_ocan

     my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'], parse_dates = ['creation', 'timespan'])
     my_ocan = my_ocan.iloc[1:]  # to remove the first row
     my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
     # Period parsing on my_ocan['timespan']
     print(my_ocan.info())

     return my_ocan

Which outputs:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 1 to 1
Data columns (total 7 columns):
oci           1 non-null object
citing        1 non-null object
cited         1 non-null object
creation      1 non-null datetime64[ns]
timespan      1 non-null object
journal_sc    1 non-null object
author_sc     1 non-null object
dtypes: datetime64[ns](1), object(6)

Notice creation is of type datetime64[ns] now.

Upvotes: 1

Related Questions