Tom
Tom

Reputation: 8790

How can I copy DataFrames with datetimes from Stack Overflow into Python?

I see frequent pandas examples on SO using time series that have spaces within the timestamps:

                     A
2020-01-01 09:20:00  0
2020-01-01 09:21:00  1
2020-01-01 09:22:00  2
2020-01-01 09:23:00  3
2020-01-01 09:24:00  4

Or this where the times aren't part of the index:

                dates    values cat
0 2020-01-01 09:20:00  0.758513   a
1 2020-01-01 09:21:00  0.337325   b
2 2020-01-01 09:22:00  0.618372   b
3 2020-01-01 09:23:00  0.878714   b
4 2020-01-01 09:24:00  0.311069   b

Is there a good way to copy these (or similar) data back into Python to work with? I have found posts like this and this which were lifesavers for getting many examples out of SO, yet I usually fail to find a copy/paste approach (using pd.read_clipboard() or pd.read_table()) that works for such data. This often then discourages me from attempting an answer1.

The examples above were created this way:

#one
import pandas as pd
import numpy

dr = pd.date_range('01-01-2020 9:20', '01-01-2020 9:24', freq='1T')
df1 = pd.DataFrame(index=dr, data=range(len(dr)), columns=['A'])

#two
df2 = pd.DataFrame({'dates':dr,
                    'values':np.random.rand(len(dr)),
                    'cat':np.random.choice(['a','b'],len(dr))})

1. And for the record, I reckon the onus should be on posters to post their data in a more copy-able format, else not get answered. For time series information, rather than pasting the string representation of the DataFrame I always try to post constructing code (using pd.date_range() or w/e). And I imagine using something like df.to_dict() would be better if there are specific (irregular interval) dates that need to be copied for the example.

Upvotes: 6

Views: 1389

Answers (2)

Parfait
Parfait

Reputation: 107567

What I normally do for Pandas SO questions is simply quote the date and time values (after copy-paste). If rows are numerous, I will use a text editor to do a Find/Replace to add quotes usually before year pattern (20XX) and after day pattern (-01). From there, using StringIO, specify parse_dates and as needed the index_col argument in read.csv (or read.table)

Datetime as Index

from io import StringIO
import pandas as pd

txt = '''             A
"2020-01-01 09:20:00" 0
"2020-01-01 09:21:00" 1
"2020-01-01 09:22:00" 2
"2020-01-01 09:23:00" 3
"2020-01-01 09:24:00" 4'''

df = pd.read_csv(StringIO(txt), sep="\s+", parse_dates=True)

df.index    
# DatetimeIndex(['2020-01-01 09:20:00', '2020-01-01 09:21:00',
#                '2020-01-01 09:22:00', '2020-01-01 09:23:00',
#                '2020-01-01 09:24:00'],
#               dtype='datetime64[ns]', freq=None)

Datetime as column

txt = '''         dates    value  cat
0 "2020-01-01 09:20:00"  0.758513   a
1 "2020-01-01 09:21:00"  0.337325   b
2 "2020-01-01 09:22:00"  0.618372   b
3 "2020-01-01 09:23:00"  0.878714   b
4 "2020-01-01 09:24:00"  0.311069   b'''

df = pd.read_csv(StringIO(txt), sep="\s+", index_col=[0], parse_dates=["dates"])

df.dtypes
# dates     datetime64[ns]
# values           float64
# cat               object
# dtype: object

Upvotes: 1

Ankur
Ankur

Reputation: 1110

I usually copy the entire string and then parse it. It's not perfect and you usually have to edit both the string and the dataframe to make it usable. Here is one example. This solution was already provided in this answer. I've only added the bit about parsing date/time.

import pandas as pd
from io import StringIO
from dateutil.parser import parse

# I added two more column names `date` and `time`.
# An advantage of having the string in your python code is that
# you can edit it in your text editor/jupyter notebook quickly and directly.
s = """date time A
2020-01-01 09:20:00  0
2020-01-01 09:21:00  1
2020-01-01 09:22:00  2
2020-01-01 09:23:00  3
2020-01-01 09:24:00  4"""

# Parse using whitespace separator. This will still not be perfect as we can
# see below.
df = pd.read_csv(StringIO(s), sep="\s+", index_col=False)
df
#          date      time  A
# 0  2020-01-01  09:20:00  0
# 1  2020-01-01  09:21:00  1
# 2  2020-01-01  09:22:00  2
# 3  2020-01-01  09:23:00  3
# 4  2020-01-01  09:24:00  4

# Combine date and time column together and drop the individual columns.
df['datetime'] = df['date'] + " " + df['time']
df = df.drop(['date', 'time'], axis=1)

# Use a somewhat universal parser in dateutil.parser.parse to parse the
# dates into proper dateime object.
df['datetime'] = df['datetime'].apply(parse)
df
#    A            datetime
# 0  0 2020-01-01 09:20:00
# 1  1 2020-01-01 09:21:00
# 2  2 2020-01-01 09:22:00
# 3  3 2020-01-01 09:23:00
# 4  4 2020-01-01 09:24:00

df.index
# RangeIndex(start=0, stop=5, step=1)

df.dtypes
# A                    int64
# datetime    datetime64[ns]
# dtype: object

df.columns
# Index(['A', 'datetime'], dtype='object')

One method to provide formatted and parseable dataframe on StackOverflow is by outputting a csv-formatted string.

# Continued from above
print(df.to_csv(index=False))
# A,datetime
# 0,2020-01-01 09:20:00
# 1,2020-01-01 09:21:00
# 2,2020-01-01 09:22:00
# 3,2020-01-01 09:23:00
# 4,2020-01-01 09:24:00

# We can indeed parse nicely from the csv-formatted string 
s_redux = df.to_csv(index=False)
pd.read_csv(StringIO(s_redux))
#    A             datetime
# 0  0  2020-01-01 09:20:00
# 1  1  2020-01-01 09:21:00
# 2  2  2020-01-01 09:22:00
# 3  3  2020-01-01 09:23:00
# 4  4  2020-01-01 09:24:00

Here is one attempt at parsing the second example dataframe. As before, we do need to make some "edits" to the dataframe to make it usable.

import pandas as pd
from io import StringIO
from dateutil.parser import parse

s="""                dates    values cat
0 2020-01-01 09:20:00  0.758513   a
1 2020-01-01 09:21:00  0.337325   b
2 2020-01-01 09:22:00  0.618372   b
3 2020-01-01 09:23:00  0.878714   b
4 2020-01-01 09:24:00  0.311069   b"""

df = pd.read_csv(StringIO(s), sep="\s+").reset_index()
df
#    level_0     level_1     dates    values cat
# 0        0  2020-01-01  09:20:00  0.758513   a
# 1        1  2020-01-01  09:21:00  0.337325   b
# 2        2  2020-01-01  09:22:00  0.618372   b
# 3        3  2020-01-01  09:23:00  0.878714   b
# 4        4  2020-01-01  09:24:00  0.311069   b

df['dates'] = df['level_1'] + " " + df['dates']
df = df.drop(['level_0', 'level_1'], axis=1)
df['dates'] = df['dates'].apply(parse)

df
#                 dates    values cat
# 0 2020-01-01 09:20:00  0.758513   a
# 1 2020-01-01 09:21:00  0.337325   b
# 2 2020-01-01 09:22:00  0.618372   b
# 3 2020-01-01 09:23:00  0.878714   b
# 4 2020-01-01 09:24:00  0.311069   b

df.dtypes
# dates     datetime64[ns]
# values           float64
# cat               object
# dtype: object

Upvotes: 3

Related Questions