Reputation: 41
New to Python, so I need a bit of help.
I have a CSV file that has an id, created_at date, first/last name columns.
id created_at first_name last_name
1 1309380645 Cecelia Holt
2 1237178109 Emma Allison
3 1303585711 Desiree King
4 1231175716 Sam Davidson
I want to filter the rows between two dates lets say 03-22-2016
and 04-15-2016
(dates don't really matter), and then order those rows in ascending order (by created_at)
I know this code will just show all or most of the data
import csv
from datetime import datetime
with open("sample_data.csv") as f:
reader = csv.reader(f)
for row in reader:
print(" ".join(row[]))
But I'm not sure how to do the rest, or how to filter using this timestamp 1309380645
would using pandas
be more beneficial for me, over using csv?
Any help is much appreciated or a guide/book to read for more understanding.
Upvotes: 0
Views: 1444
Reputation: 533
I recommend using pandas since it will help you filter and perform further analysis faster.
# import pandas and datetime
import pandas as pd
import datetime
# read csv file
df = pd.read_csv("sample_data.csv")
# convert created_at from unix time to datetime
df['created_at'] = pd.to_datetime(df['created_at'], unit='s')
# contents of df at this point
# id created_at first_name last_name
# 0 1 2011-06-29 20:50:45 Cecelia Holt
# 1 2 2009-03-16 04:35:09 Emma Allison
# 2 3 2011-04-23 19:08:31 Desiree King
# 3 4 2009-01-05 17:15:16 Sam Davidson
# filtering example
df_filtered = df[(df['created_at'] <= datetime.date(2011,3,22))]
# output of df_filtered
# id created_at first_name last_name
# 1 2 2009-03-16 04:35:09 Emma Allison
# 3 4 2009-01-05 17:15:16 Sam Davidson
# filter based on dates mentioned in the question
df_filtered = df[(df['created_at'] >= datetime.date(2016,3,22)) & (df['created_at'] <= datetime.date(2016,4,15))]
# output of df_filtered would be empty at this point since the
# dates are out of this range
# sort
df_sorted = df_filtered.sort_values(['created_at'])
First thing that you need to know is that using a comparison operator on a dataframe returns a dataframe with boolean values.
df['id'] > 2
Would return
False
False
True
True
Now, pandas supports logical indexing. So if you pass a dataframe with boolean values to pandas, if will return only the ones that correspond to True.
df[df['id'] > 2]
Returns
3 1303585711 Desiree King
4 1231175716 Sam Davidson
This is how you can filter easily in pandas
Upvotes: 2
Reputation: 123473
Downloading and installing (and learning) pandas
just to do this seems like overkill.
Here's how to do it using only Python's built-in modules:
import csv
from datetime import datetime, date
import sys
start_date = date(2011, 1, 1)
end_date = date(2011, 12, 31)
# Read csv data into memory filtering rows by the date in column 2 (row[1]).
csv_data = []
with open("sample_data.csv", newline='') as f:
reader = csv.reader(f, delimiter='\t')
header = next(reader)
csv_data.append(header)
for row in reader:
creation_date = date.fromtimestamp(int(row[1]))
if start_date <= creation_date <= end_date:
csv_data.append(row)
if csv_data: # Anything found?
# Print the results in ascending date order.
print(" ".join(csv_data[0]))
# Converting the timestamp to int may not be necessary (but doesn't hurt)
for row in sorted(csv_data[1:], key=lambda r: int(r[1])):
print(" ".join(row))
Upvotes: 1