Reputation: 149
I have a number of large CSV files (each at around two million rows), which have rows of timestamps looking like this:
16.01.2019 12:52:22
16.01.2019 12:52:23
16.01.2019 12:52:24
Given that there's an entry for each second (over a course of about a year), it should be understandable why there are so many rows. I want to be more flexible, which is why I want to divide the timestamps into three rows: date, date+hour, date+hour+minute, date+hour+second, so that I'm able to group timestamps at will. This is how I'm doing it:
dates = []
hours = []
minutes = []
seconds = []
i = 0
#initial values
dates.append(str(get_date(i).date()))
hours.append(str(get_date(i).hour))
minutes.append(str(get_date(i).minute))
seconds.append(str(get_date(i).second))
for i in range(len(df)):
if i < len(df) - 1 :
if str(get_date(i).date) < str(get_date(i+1).date): #dates: YYYY-MM-DD
dates.append(str(get_date(i+1).date()))
else:
dates.append(str(get_date(i).date()))
if str(get_date(i).hour) < str(get_date(i+1).hour): #dates+hours: YYYY-MM-DD HH
hours.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour))
else:
hours.append(str(get_date(i).date()) + " " + str(get_date(i).hour))
if str(get_date(i).minute) < str(get_date(i+1).minute): #dates+hours+minutes: YYYY-MM-DD HH:mm
minutes.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute))
else:
minutes.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute))
if str(get_date(i).second) < str(get_date(i+1).second): #dates+hours+minutes+seconds: YYYY-MM-DD HH:mm+ss
seconds.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute) + ":" + str(get_date(i+1).second))
else:
seconds.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute) + ":" + str(get_date(i).second))
df["dates"] = dates
df["hours"] = hours
df["minutes"] = minutes
df["seconds"] = seconds
where get_date()
is simply a function returning the timestamp with the given index:
def get_date(i):
return (dt.datetime.strptime(df["timestamp"][i], '%d.%m.%Y %H:%M:%S'))
I basically iterate through all entries, put each date/hour/minute/second into a list, and then insert them each into my dataframe.and put them into
where get_date()
is simply a function returning the timestamp with the given index.
I guess this would put me at O(n²)
? Which is obviously not ideal.
Now, doing this on one file (~60MB, 2 million rows) takes half an hour. I personally can't think of another way to do what I want to do, so I just wanted to see if there's anything I can do to reduce the complexity.
edit: Tweaking @Chris' answer for my needs:
times = bogie_df["timestamp"]
#got an error when applying map directly into pd.DataFrame, which is why I first converted it into a list
items = ['year', 'month', 'day', 'hour', 'minute', 'second']
df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(times))), columns=items)
#for my desired YYYY-MM-DD format (though attrgetter only return "1" for "January instead of "01"
df["date"] = df['year'].map(str) + "-" + df["month"].map(str) + df["day"].map(str)
Upvotes: 2
Views: 966
Reputation: 29742
Use operator.attrgetter
with pd.to_datetime
:
import pandas as pd
import operator
s = pd.Series(["16.01.2019 12:52:22",
"16.01.2019 12:52:23",
"16.01.2019 12:52:24"])
items = ['day', 'hour', 'minute', 'second']
df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(s))), columns=items)
Output:
day hour minute second
0 16 12 52 22
1 16 12 52 23
2 16 12 52 24
Benchmark:
large_s = pd.Series(pd.date_range('16.01.2019 12:52:22', periods=2000000, freq='1s').astype(str).tolist())
# Make 2M rows of timestamp in str
%%timeit
items = ['day', 'hour', 'minute', 'second']
df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(large_s))), columns=items)
# 6.77 s ± 54.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
takes about 7 seconds.
Updated:
Instead of manually creating a YYYY-mm-dd
-formated strings, let pandas
do it for you.
df = pd.DataFrame(pd.to_datetime(s), columns = ['date'])
items = ['year', 'month', 'day', 'hour', 'minute', 'second']
df[items] = pd.DataFrame(list(map(operator.attrgetter(*items), df['date'])), columns=items)
Then:
df['dates'] = df['date'].astype(str).str[:10]
Output (with the months padding with zero):
0 2019-01-16
1 2019-01-16
2 2019-01-16
Upvotes: 4
Reputation: 149155
You do not need that, but instead should add a single column of Timestamp
type:
df['ts'] = pd.to_datetime(df.timestamp, format='%d.%m.%Y %H:%M:%S')
Then you can directly use all the time goodies on that column:
df.ts.dt.date
: gives the date as a datetime.datedf.ts.dt.strftime(format)
: gives the date as a string formatted with format. For exampe, df.ts.dt.strftime("YYYY-MM-DD HH:mm")
is your "minutes"
columndf.ts.dt.floor(freq='h')
: is a timestamp truncated at hour level, for example for groupingUpvotes: 0
Reputation: 4892
Since comments are limited in length at least following points:
if i < len(df) - 1 :
, you don't need it. Replace your range
with range(len(df)-1)
.get_date
function:Before the loop:
next_time = get_date(0)
Inside the loop:
current_time = next_time
next_time = get_date(i+1)
This should save you some function calls, but probably pandas
has some better way to do such stuff.
Upvotes: 0