dynamically skiprows in reading multiple csv files

Question

I would like to read a csv file every month from the government census website here, more specifically, the one named VIP-mf.zip. To save your eyes from a cumbersome df, you can download the zip file using this (its 1.5MB)

I only want to read the csv after the row that says 'DATA', which for this specific file is row 309. I can do that easily using:

import pandas as pd
df = pd.read_csv('VIP-mf.csv', skiprows=310)

the problem is next month, when the new csv is updated on the website — that skiprows parameter will have to be 311, or else it reads it incorrectly. I would like to have a dynamic skiprows parameter to be able to capture this change every month so I can automatically download and read it correctly.

I tried implementing a solution from this answer using this article by creating a function for the skiprows parameter using the following:

def fetch_skip(index):
    if index == 'DATA':
        return True
    return False
df = pd.read_csv('VIP-mf.csv', skiprows= lambda x: fetch_skip(x))

but I get this error:

ParserError: Error tokenizing data. C error: Expected 4 fields in line 311, saw 7

Which I'm assuming is because the csv has "mini-tables" within a single csv. Even though, I only need the final "table" which has the column names:

['per_idx', 'cat_idx', 'dt_idx', 'et_idx', 'geo_idx', 'is_adj', 'val']

Thank you for your help. P.S If there is another way to do this than fiddling with the skiprows parameter that also works.

DataPlug · Accepted Answer

I found the answer in another question found here. I had to make a slight change.

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

Then I used:

df = skip_to('path_to_file.csv',"DATA", skiprows=1)

dynamically skiprows in reading multiple csv files

Answers (2)

Related Questions