Reputation: 11
I need to read a CSV into a list of tuples while conditioning the list on a value (>=0.75) and changing the columns to different typing. Please note you cannot!! use pandas, NO PANDAS
I'm trying to figure out how to do it the FASTEST method possible.
This is how I did it (put i think it is not efficient):
def load_csv_to_list(path):
with open(path) as csv_file:
table = list(reader(csv_file))
lst = [table[0]]
count = 0
for row in table[1:]:
if float(row[2]) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
row = (date,int(row[1]),float(row[2]))
lst.append(row)
return (lst)
start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)
answer : 0.00013872199997422285
Upvotes: 1
Views: 148
Reputation: 5425
The original code performs the same float(row[2])
conversion twice. In my testing, assigning the converted value to a variable and reusing it later gives a slight performance gain. Utilising the walrus operator :=
, introduced in Python 3.8 gives a further improvement. Using batch processing or memory-mapping the data file gives the best performance.
def load_variable(path):
with open(path) as csv_file:
table = list(reader(csv_file))
lst = [table[0]]
for row in table[1:]:
float_two = float(row[2])
if float_two >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
row = (date, int(row[1]), float_two)
lst.append(row)
return lst
def load_walrus(path):
with open(path) as csv_file:
table = list(reader(csv_file))
lst = [table[0]]
for row in table[1:]:
if (float_two := float(row[2])) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
row = (date, int(row[1]), float_two)
lst.append(row)
return lst
Timings to load a csv file with 1,000,000 rows:
Function Name | Fastest | Slowest | Average |
load_csv_to_list | 6.36s | 6.69s | 6.47s |
load_variable | 6.10s | 6.65s | 6.44s |
load_walrus | 5.95s | 6.57s | 6.29s |
As a further experiment I implemented a function to batch process the data.
def batch_walrus(path, batch_size=1000):
lst = []
with open(path) as csv_file:
csv_reader = reader(csv_file)
header = next(csv_reader) # Read the header
lst.append(header) # Add the header to the result list
batch = []
for row in csv_reader:
# Check the condition and convert the date
if (two := float(row[2])) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
batch.append((date, int(row[1]), two))
# If batch size is reached or end of file, process the batch
if len(batch) == batch_size or not row:
lst.extend(batch)
batch = []
return lst
Updated timing information:
Function Name | Fastest | Slowest | Average |
load_csv_to_list | 6.36s | 6.69s | 6.47s |
load_variable | 6.10s | 6.65s | 6.44s |
load_walrus | 5.95s | 6.57s | 6.29s |
batch_walrus | 5.69s | 5.89s | 5.79s |
Python's mmap
module provides memory-mapped file I/O. It takes advantage of lower-level operating system functionality to read files as if they were one large string/array. This version of the function decodes the mmapped_file
content into a string using decode("utf-8")
before creating the csv.reader
.
from csv import reader
from datetime import datetime
import mmap
def load_mmap_walrus(path):
lst = []
with open(path, "r") as csv_file:
# Memory-map the file, size 0 means the entire file
with mmap.mmap(csv_file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file:
# Decode the bytes-like object to a string
content = mmapped_file.read().decode("utf-8")
# Create a CSV reader from the decoded string
csv_reader = reader(content.splitlines())
header = next(csv_reader) # Read the header
lst.append(header) # Add the header to the result list
for row in csv_reader:
# Check the condition and convert the date
if (two := float(row[2])) >= 0.75:
date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
lst.append((date, int(row[1]), two))
# Close the memory-mapped file
mmapped_file.close()
return lst
Updated timing information:
Function Name | Fastest | Slowest | Average |
load_csv_to_list | 6.36s | 6.69s | 6.47s |
load_variable | 6.10s | 6.65s | 6.44s |
load_walrus | 5.95s | 6.57s | 6.29s |
batch_walrus | 5.69s | 5.89s | 5.79s |
load_mmap_walrus | 5.49s | 5.68s | 5.57s |
Code used to generate 1,000,000 rows of csv data:
import csv
import random
from datetime import datetime, timedelta
# Function to generate a random date within a range
def random_date(start_date, end_date):
delta = end_date - start_date
random_days = random.randint(0, delta.days)
return start_date + timedelta(days=random_days)
# Generate sample data
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)
with open("sample_data.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["Date", "Integer", "Float"])
for _ in range(1_000_000):
date = random_date(start_date, end_date).strftime("%d/%m/%Y")
integer = random.randint(0, 100)
float_num = round(random.uniform(0, 1), 2)
writer.writerow([date, integer, float_num])
Upvotes: 0