Mr.Monsieur
Mr.Monsieur

Reputation: 41

CSV filtering and ascending order

New to Python, so I need a bit of help.

I have a CSV file that has an id, created_at date, first/last name columns.

id  created_at  first_name last_name
1   1309380645  Cecelia    Holt
2   1237178109  Emma       Allison
3   1303585711  Desiree    King
4   1231175716  Sam        Davidson

I want to filter the rows between two dates lets say 03-22-2016 and 04-15-2016(dates don't really matter), and then order those rows in ascending order (by created_at)

I know this code will just show all or most of the data

import csv
from datetime import datetime

with open("sample_data.csv") as f:
    reader = csv.reader(f)
    for row in reader:
        print(" ".join(row[]))

But I'm not sure how to do the rest, or how to filter using this timestamp 1309380645 would using pandas be more beneficial for me, over using csv?

Any help is much appreciated or a guide/book to read for more understanding.

Upvotes: 0

Views: 1444

Answers (2)

shish023
shish023

Reputation: 533

I recommend using pandas since it will help you filter and perform further analysis faster.

# import pandas and datetime
import pandas as pd
import datetime

# read csv file
df = pd.read_csv("sample_data.csv")

# convert created_at from unix time to datetime
df['created_at'] = pd.to_datetime(df['created_at'], unit='s')

# contents of df at this point
#   id          created_at first_name last_name
# 0   1 2011-06-29 20:50:45    Cecelia      Holt
# 1   2 2009-03-16 04:35:09       Emma   Allison
# 2   3 2011-04-23 19:08:31    Desiree      King
# 3   4 2009-01-05 17:15:16        Sam  Davidson

# filtering example
df_filtered = df[(df['created_at'] <= datetime.date(2011,3,22))]

# output of df_filtered
#    id          created_at first_name last_name
# 1   2 2009-03-16 04:35:09       Emma   Allison
# 3   4 2009-01-05 17:15:16        Sam  Davidson

# filter based on dates mentioned in the question
df_filtered = df[(df['created_at'] >= datetime.date(2016,3,22)) & (df['created_at'] <= datetime.date(2016,4,15))]

# output of df_filtered would be empty at this point since the 
# dates are out of this range

# sort
df_sorted = df_filtered.sort_values(['created_at'])

Explanation of filtering in pandas:

First thing that you need to know is that using a comparison operator on a dataframe returns a dataframe with boolean values.

df['id'] > 2

Would return

False
False
 True
 True

Now, pandas supports logical indexing. So if you pass a dataframe with boolean values to pandas, if will return only the ones that correspond to True.

df[df['id'] > 2]

Returns

3   1303585711  Desiree    King
4   1231175716  Sam        Davidson

This is how you can filter easily in pandas

Upvotes: 2

martineau
martineau

Reputation: 123473

Downloading and installing (and learning) pandas just to do this seems like overkill.

Here's how to do it using only Python's built-in modules:

import csv
from datetime import datetime, date
import sys

start_date = date(2011, 1, 1)
end_date = date(2011, 12, 31)

# Read csv data into memory filtering rows by the date in column 2 (row[1]).
csv_data = []
with open("sample_data.csv", newline='') as f:
    reader = csv.reader(f, delimiter='\t')
    header = next(reader)
    csv_data.append(header)
    for row in reader:
        creation_date = date.fromtimestamp(int(row[1]))
        if start_date <= creation_date <= end_date:
            csv_data.append(row)

if csv_data:  # Anything found?
    # Print the results in ascending date order.
    print(" ".join(csv_data[0]))
    # Converting the timestamp to int may not be necessary (but doesn't hurt)
    for row in sorted(csv_data[1:], key=lambda r: int(r[1])): 
        print(" ".join(row))

Upvotes: 1

Related Questions