Suleka_28
Suleka_28

Reputation: 2919

Read and reverse data chunk by chunk from a csv file and copy to a new csv file

Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:

1) Read chunk (eg: 10 rows) of data from csv using pandas.

2) Reverse the order of data

3) Copy each row to new csv file in reverse. So each chunk (10 rows) is written to csv from beginning in reversed order.

In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.

I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.

The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo

My attempt does not copy the rows into the new csv file properly.

Show below is my code:

import pandas as pd
import csv

def reverse():

    fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
              "SchoolHoliday"]
    with open('processed_train.csv', mode='a') as stock_file:
        writer = csv.writer(stock_file,delimiter=',', quotechar='"', 
                                                quoting=csv.QUOTE_MINIMAL)
        writer.writerow(fields)

    for chunk in pd.read_csv("train.csv", chunksize=10):
        store_data = chunk.reindex(index=chunk.index[::-1])
        append_data_csv(store_data)

def append_data_csv(store_data):
    with open('processed_train.csv', mode='a') as store_file:
        writer = csv.writer(store_file,delimiter=',', quotechar='"',
                                           quoting=csv.QUOTE_MINIMAL)
        for index, row in store_data.iterrows():
            print(row)
            writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
            row['Customers'],row['Open'],row['Promo'],
            row['StateHoliday'],row['SchoolHoliday']])

reverse()

Thank you, in advance

Upvotes: 15

Views: 3040

Answers (5)

gustavovelascoh
gustavovelascoh

Reputation: 1228

Using bash, you can tail the whole file except the first line and then reverse it and store it with this:

tail -n +2 train.csv  | tac > train_rev.csv

If you want to keep the header in the reversed file, write it first and then append the reversed content

head -1 train.csv > train_rev.csv; tail -n +2 train.csv  | tac >> train_rev.csv

Upvotes: 6

jpp
jpp

Reputation: 164773

If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.

Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).

import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
    chunk.iloc[::-1].to_pickle(f'file_pkl_{idx:03}.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
    for i in range(idx, -1, -1):
        chunk_pkl = pd.read_pickle(f'file_pkl_{i:03}.pkl.gzip', compression='gzip')
        chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

    0   1   2   3   4
0  40  41  42  43  44
1  35  36  37  38  39
2  30  31  32  33  34
3  25  26  27  28  29
4  20  21  22  23  24
5  15  16  17  18  19
6  10  11  12  13  14
7   5   6   7   8   9
8   0   1   2   3   4

Upvotes: 0

Mark Warburton
Mark Warburton

Reputation: 517

This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.

For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.

from os import remove

def writechunk(fileCounter, reverseString):
    outFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(outFile, 'w') as outfp:
        outfp.write(reverseString)
    return

def main():
    inFile = 'intest.csv'
    outFile = 'outtest.csv'
    # This is our chunk expressed in lines
    maxLines = 10
    # Is there a header line we want to keep at the top of the output file?
    keepHeader = True

    fileCounter = 0
    lineCounter = 0
    with open(inFile) as infp:
        reverseString = ''
        line = infp.readline()
        if (line and keepHeader):
            headerLine = line
            line = infp.readline()
        while (line):
            lineCounter += 1
            reverseString = line + reverseString
            if (lineCounter == maxLines):
                fileCounter += 1
                lineCounter = 0
                writechunk(fileCounter, reverseString)
                reverseString = ''
            line = infp.readline()
    # Write any leftovers to a chunk file
    if (lineCounter != 0):
        fileCounter += 1
        writechunk(fileCounter,reverseString)
    # Read the chunk files backwards and append each to the outFile
    with open(outFile, 'w') as outfp:
        if (keepHeader):
            outfp.write(headerLine)
        while (fileCounter > 0):
            chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
            with open(chunkFile, 'r') as infp:
                outfp.write(infp.read())
            remove(chunkFile)
            fileCounter -= 1

if __name__ == '__main__':
    main()

Upvotes: 0

Abhishek Dujari
Abhishek Dujari

Reputation: 2453

You have repeated code blocks and you are not taking advantage of pandas at all.

What @sujay kumar pointed out is very correct, I would read that more closely.

The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv() you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.

when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.

Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.

df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", usecols=["Date", "Adj Close"], na_values=["nan"])

sorting a Series

s = pd.Series(list('abcde'), index=[0,3,2,5,4]) s.sort_index()

Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)

Upvotes: -3

BernardL
BernardL

Reputation: 5464

I would not recommend using pandas for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.

What I did was just tested it with your file train.csv from the provided link and output the results in a new file.

import os

def reverse_readline(filename, buf_size=8192):
    """a generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
        segment = None
        offset = 0
        fh.seek(0, os.SEEK_END)
        file_size = remaining_size = fh.tell()
        while remaining_size > 0:
            offset = min(file_size, offset + buf_size)
            fh.seek(file_size - offset)
            buffer = fh.read(min(remaining_size, buf_size))
            remaining_size -= buf_size
            lines = buffer.split('\n')
            # the first line of the buffer is probably not a complete line so
            # we'll save it and append it to the last line of the next buffer
            # we read
            if segment is not None:
                # if the previous chunk starts right from the beginning of line
                # do not concact the segment to the last line of new chunk
                # instead, yield the segment first 
                if buffer[-1] != '\n':
                    lines[-1] += segment
                else:
                    yield segment
            segment = lines[0]
            for index in range(len(lines) - 1, 0, -1):
                if lines[index]:
                    yield lines[index]
        # Don't yield None if the file was empty
        if segment is not None:
            yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
    for row in reverse_gen:
        f.write('{}\n'.format(row))

It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.

Upvotes: 0

Related Questions