Reputation: 33
I have a csv document, an example below:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
There are 2 columns, representing dates, I want to change column type from string to datetime
format.
The column creation
(string) is the date of creation and can be expressed in three forms in addition to be unspecified:
The column timespan
(string): is represented according to the form PnYnMnD
, where P
is a literal value that starts the expression, nY
is the number of years followed by a literalY
, nM
is the number of months followed by a literal M
, nD
is the number of days followed by a literal D
, where any of these numbers and corresponding designators may be absent if they are equal to 0. A minus sign may appear before the P
to indicate negative duration.
I am trying to change column type from string to datetime
format using pd.to_datetime()
function:
def do_process(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'], parse_dates = ['creation', 'timespan'])
my_ocan['timespan'] = pd.to_datetime(my_ocan['timespan'], format='%Y%m%d', errors='ignore', yearfirst=True)
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format='%Y%m%d', errors='ignore', yearfirst=True)
#print(my_ocan['citing'])
print(my_ocan.info())
return my_ocan
When executing print(my_ocan.info())
I get '214 non-null object' instead of datetime
. What am I missing? What's the issue?
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 7 columns):
oci 214 non-null object
citing 214 non-null object
cited 214 non-null object
creation 214 non-null object
timespan 214 non-null object
journal_sc 214 non-null object
author_sc 214 non-null object
dtypes: object(7)
Thank you and have an awesome day y'all :)
Upvotes: 0
Views: 214
Reputation: 114
There are a couple of issues to address in your code.
First, notice that in your .csv
file the first column is:
oci,citing,cited,creation,timespan,journal_sc,author_sc
So when you're building a database with pd.read_csv
the first row of your dataframe will be the first row of your .csv
file. You end up with your dataframe being :
oci ... author_sc
0 oci ... author_sc
1 0200100000236252421370109080537010700020300040... ... no
instead of :
[2 rows x 7 columns]
oci ... author_sc
1 0200100000236252421370109080537010700020300040... ... no
I don't think you want that.
You also suppress errors with pd.to_datetime
which isn't great. Once you remove errors='ignore'
, you'll notice that the conversion to datetime
fails because the input doesn't match the '%Y%m%d'
format.
And of course it doesn't since your creation
column is:
0 creation
1 1985-04
And the first row is "creation"
which doesn't match '%Y%m%d'
.
Second, the date format should be '%Y-%m-%d'
, since the dates are of a YYYY-MM-DD format and not YYYYMMDD.
(Careful, because when your format has days in it for example and the input doesn't, a day is added to the date).
Third, the timespan
values aren't dates, they're periods (a Java period if I'm not mistaken), so using pd.to_datetime
on them doesn't work. I haven't found a python function that does the conversion for you so you might have to do the work yourself in a good old string parsing fashion, or dig deeper in python libs.
Fourth, as the documentation says, pd.to_datetime
is deprecated so you'd do well to stay away and use for example datetime.strptime
instead which does a very similar job.
Finally, here's a working version of your code:
def do_process(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'], parse_dates = ['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
# Period parsing on my_ocan['timespan']
print(my_ocan.info())
return my_ocan
Which outputs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 1 to 1
Data columns (total 7 columns):
oci 1 non-null object
citing 1 non-null object
cited 1 non-null object
creation 1 non-null datetime64[ns]
timespan 1 non-null object
journal_sc 1 non-null object
author_sc 1 non-null object
dtypes: datetime64[ns](1), object(6)
Notice creation
is of type datetime64[ns]
now.
Upvotes: 1