user3495042
user3495042

Reputation: 327

read and concatenate 3,000 files into a pandas data frame starting at a specific value, python

I have 3,000 .dat files that I am reading and concatenating into one pandas dataframe. They have the same format (4 columns, no header) except that some of them have a description at the beginning of the file while others don't. In order to concatenate those files, I need to get rid of those first rows before I concatenate them. The skiprows option of the pandas.read_csv() doesn't apply here, because the number of rows to skip is very inconsistent from one file to another (btw, I use pandas.read_csv() and not pandas.read_table() because the files are separated by a coma).

However, the fist value after the rows I am trying to omit is identical for all 3,000 files. This value is "2004", which is the first data point of my dataset.

Is there an equivalent to skiprows where I could mention something such as "start reading the file starting at "2004" and skip everything else before that (for each of the 3,00 files)?

I am really out of luck at this point and would appreciate some help,

Thank you!

Upvotes: 2

Views: 535

Answers (3)

HYRY
HYRY

Reputation: 97321

uss the skip_to() function:

def skip_to(f, text):
    while True:
        last_pos = f.tell()
        line = f.readline()
        if not line:
            return False
        if line.startswith(text):
            f.seek(last_pos)
            return True


with open("tmp.txt") as f:
    if skip_to(f, "2004"):
        df = pd.read_csv(f, header=None)
        print df

Upvotes: 1

DSM
DSM

Reputation: 353229

Probably not worth trying to be clever here; if you have a handy criterion, you might as well use it to figure out what skiprows is, i.e. something like

import pandas as pd
import csv

def find_skip(filename):
    with open(filename, newline="") as fp:
        # (use open(filename, "rb") in Python 2)
        reader = csv.reader(fp)
        for i, row in enumerate(reader):
            if row[0] == "2004":
                return i

for filename in filenames:
    skiprows = find_skip(filename)
    if skiprows is None:
        raise ValueError("something went wrong in determining skiprows!")
    this_df = pd.read_csv(filename, skiprows=skiprows, header=None)
    # do something here, e.g. append this_df to a list and concatenate it after the loop

Upvotes: 2

bsoist
bsoist

Reputation: 785

You could just loop through them and skip every line that doesn't start with 2004.

Something like ...

while True:
    line = pandas.read_csv()
    if line[0] != '2004': continue
    # whatever else you need here

Upvotes: 2

Related Questions