DataPlug
DataPlug

Reputation: 348

dynamically skiprows in reading multiple csv files

I would like to read a csv file every month from the government census website here, more specifically, the one named VIP-mf.zip. To save your eyes from a cumbersome df, you can download the zip file using this (its 1.5MB)

I only want to read the csv after the row that says 'DATA', which for this specific file is row 309. I can do that easily using:

import pandas as pd
df = pd.read_csv('VIP-mf.csv', skiprows=310)

the problem is next month, when the new csv is updated on the website — that skiprows parameter will have to be 311, or else it reads it incorrectly. I would like to have a dynamic skiprows parameter to be able to capture this change every month so I can automatically download and read it correctly.

I tried implementing a solution from this answer using this article by creating a function for the skiprows parameter using the following:

def fetch_skip(index):
    if index == 'DATA':
        return True
    return False
df = pd.read_csv('VIP-mf.csv', skiprows= lambda x: fetch_skip(x))

but I get this error:

ParserError: Error tokenizing data. C error: Expected 4 fields in line 311, saw 7

Which I'm assuming is because the csv has "mini-tables" within a single csv. Even though, I only need the final "table" which has the column names:

['per_idx', 'cat_idx', 'dt_idx', 'et_idx', 'geo_idx', 'is_adj', 'val']

Thank you for your help. P.S If there is another way to do this than fiddling with the skiprows parameter that also works.

Upvotes: 0

Views: 842

Answers (2)

sammywemmy
sammywemmy

Reputation: 28709

An alternative, if you have access to a Linux machine, you could parse the data through the shell, use grep to get the data, and read it in via pandas. note that for the grep, I used a count of 100_000 to get the rows after the match; you can vary that depending on how many rows, you think the data will be.

I also assume (and I may be wrong) that the headers will always be per_idx,cat_idx.., which is what I used in grepping:

from io import StringIO
import pandas as pd
import subprocess

shell_string = """grep -A100000 "per_idx,cat_idx" /home/sam/Downloads/VIP-mf/VIP-mf.csv"""
data = subprocess.run(shell_string, 
                        shell = True, 
                        capture_output = True, 
                        text = True).stdout
df = pd.read_csv(StringIO(data))
df.head(5)
   per_idx  cat_idx  dt_idx  et_idx  geo_idx  is_adj      val
0        1        1       1       0        1       0  59516.0
1        1        2       1       0        1       0  25972.0
2        1        3       1       0        1       0  33545.0
3        1        4       1       0        1       0    989.0
4        1        5       1       0        1       0   3763.0



df.shape
   (65472, 7)

Upvotes: 1

DataPlug
DataPlug

Reputation: 348

I found the answer in another question found here. I had to make a slight change.

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

Then I used:

df = skip_to('path_to_file.csv',"DATA", skiprows=1)

Upvotes: 2

Related Questions