Shahaf Ben
Shahaf Ben

Reputation: 11

Fastest Way to read CSV to list of tuples with condition/filter and column type assignment? (Python)

I need to read a CSV into a list of tuples while conditioning the list on a value (>=0.75) and changing the columns to different typing. Please note you cannot!! use pandas, NO PANDAS

I'm trying to figure out how to do it the FASTEST method possible.

This is how I did it (put i think it is not efficient):

def load_csv_to_list(path):
  with open(path) as csv_file:
    table = list(reader(csv_file))
  lst = [table[0]]
  count = 0
  for row in table[1:]:
    if float(row[2]) >= 0.75:
      date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
      row = (date,int(row[1]),float(row[2]))
      lst.append(row)
  return (lst)

start = timeit.timeit()
load_csv_to_list(path)
end = timeit.timeit()
print(start - end)

answer : 0.00013872199997422285

Upvotes: 1

Views: 148

Answers (1)

Dan Nagle
Dan Nagle

Reputation: 5425

The original code performs the same float(row[2]) conversion twice. In my testing, assigning the converted value to a variable and reusing it later gives a slight performance gain. Utilising the walrus operator :=, introduced in Python 3.8 gives a further improvement. Using batch processing or memory-mapping the data file gives the best performance.

def load_variable(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        float_two = float(row[2])
        if float_two >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

def load_walrus(path):
    with open(path) as csv_file:
        table = list(reader(csv_file))
    lst = [table[0]]
    for row in table[1:]:
        if (float_two := float(row[2])) >= 0.75:
            date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
            row = (date, int(row[1]), float_two)
            lst.append(row)
    return lst

Timings to load a csv file with 1,000,000 rows:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |

As a further experiment I implemented a function to batch process the data.

def batch_walrus(path, batch_size=1000):
    lst = []
    with open(path) as csv_file:
        csv_reader = reader(csv_file)
        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list
        batch = []
        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                batch.append((date, int(row[1]), two))
            # If batch size is reached or end of file, process the batch
            if len(batch) == batch_size or not row:
                lst.extend(batch)
                batch = []
    return lst

Updated timing information:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |

Python's mmap module provides memory-mapped file I/O. It takes advantage of lower-level operating system functionality to read files as if they were one large string/array. This version of the function decodes the mmapped_file content into a string using decode("utf-8") before creating the csv.reader.

from csv import reader
from datetime import datetime
import mmap

def load_mmap_walrus(path):
    lst = []
    with open(path, "r") as csv_file:
        # Memory-map the file, size 0 means the entire file
        with mmap.mmap(csv_file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file:
            # Decode the bytes-like object to a string
            content = mmapped_file.read().decode("utf-8")

        # Create a CSV reader from the decoded string
        csv_reader = reader(content.splitlines())

        header = next(csv_reader)  # Read the header
        lst.append(header)  # Add the header to the result list

        for row in csv_reader:
            # Check the condition and convert the date
            if (two := float(row[2])) >= 0.75:
                date = datetime.strptime(row[0], "%d/%m/%Y").strftime("%d/%m/%Y")
                lst.append((date, int(row[1]), two))

        # Close the memory-mapped file
        mmapped_file.close()

    return lst

Updated timing information:

Function Name    | Fastest | Slowest | Average |
load_csv_to_list | 6.36s   | 6.69s   | 6.47s   |
load_variable    | 6.10s   | 6.65s   | 6.44s   |
load_walrus      | 5.95s   | 6.57s   | 6.29s   |
batch_walrus     | 5.69s   | 5.89s   | 5.79s   |
load_mmap_walrus | 5.49s   | 5.68s   | 5.57s   |

Code used to generate 1,000,000 rows of csv data:

import csv
import random
from datetime import datetime, timedelta

# Function to generate a random date within a range
def random_date(start_date, end_date):
    delta = end_date - start_date
    random_days = random.randint(0, delta.days)
    return start_date + timedelta(days=random_days)

# Generate sample data
start_date = datetime(2000, 1, 1)
end_date = datetime(2023, 12, 31)

with open("sample_data.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Date", "Integer", "Float"])
    for _ in range(1_000_000):
        date = random_date(start_date, end_date).strftime("%d/%m/%Y")
        integer = random.randint(0, 100)
        float_num = round(random.uniform(0, 1), 2)
        writer.writerow([date, integer, float_num])

Upvotes: 0

Related Questions