Happy Coder
Happy Coder

Reputation: 4692

Read CSV file with limit and offset

I am using the following code for reading CSV file to a dictionary.

file_name = path+'/'+file.filename
with open(file_name, newline='') as csv_file:
    csv_dict = [{k: v for k, v in row.items()}
                for row in csv.DictReader(csv_file)]
    for item in csv_dict:
        call_api(item)

Now this is reading the files and calling the function for each of the row. As the number of rows increases, the number of calls also will increase. Also it is not possible to load all the contents to memory and split and call API from there as the size of the data is big. So I would like to follow an approach, so that the file will be read using limit and offset as in the case of SQL queries. But how can this be done in Python ? I am not seeing any option to specify the number of rows and skip rows in the csv documentation. Is someone can suggest a better approach also that will be fine.

Upvotes: 2

Views: 6678

Answers (2)

FredrikHedman
FredrikHedman

Reputation: 1253

A solution could be to use pandas to read the csv:

import pandas as pd

file_name = 'data.csv'
OFFSET = 10
LIMIT = 24
CHSIZE = 6
header = list('ABC')
reader = pd.read_csv(file_name, sep=',',
                     header=None, names=header,          # Header 'A', 'B', 'C'
                     usecols=[0, 1, 4],                  # Select some columns
                     skiprows=lambda idx: idx < OFFSET,  # Skip lines
                     chunksize=CHSIZE,                   # Chunk reading
                     nrows=LIMIT)

for df_chunk in reader:
    # Each df_chunk is a DataFrame, so
    # an adapted api may be needed to
    # call_api(item)
    for row in df_chunk.itertuples():
        print(row._asdict())

Upvotes: 1

Patrick Artner
Patrick Artner

Reputation: 51683

You can call your api directly just with 1 line in memory:

with open(file_name, newline='') as csv_file:
    for row in csv.DictReader(csv_file): 
        call_api(row)        # call api with row-dictionary, don't persist all to memory 

You can skip lines using next(row) before the for loop:

with open(file_name, newline='') as csv_file:
    for _ in range(10):   # skip first 10 rows
        next(csv_file)        
    for row in csv.DictReader(csv_file):

You can skip lines in between using continue:

with open(file_name, newline='') as csv_file:
    for (i,row) in enumerate(csv.DictReader(csv_file)):
        if i%2 == 0: continue                  # skip every other row

You can simply count parsed lines and break after n lines are done:

n = 0
with open(file_name, newline='') as csv_file:
    for row in csv.DictReader(csv_file):
        if n == 50: 
            break
        n += 1

and you can combine those approaches to skip 100 rows and take 200, only taking every 2th one - this mimics limit and offset and hacks using modulo on the line number.

Or you use something thats great with csv, like pandas:

Upvotes: 3

Related Questions