Reputation: 348
I would like to read a csv file every month from the government census website here, more specifically, the one named VIP-mf.zip. To save your eyes from a cumbersome df, you can download the zip file using this (its 1.5MB)
I only want to read the csv after the row that says 'DATA', which for this specific file is row 309. I can do that easily using:
import pandas as pd
df = pd.read_csv('VIP-mf.csv', skiprows=310)
the problem is next month, when the new csv is updated on the website — that skiprows
parameter will have to be 311, or else it reads it incorrectly. I would like to have a dynamic skiprows
parameter to be able to capture this change every month so I can automatically download and read it correctly.
I tried implementing a solution from this answer using this article by creating a function for the skiprows
parameter using the following:
def fetch_skip(index):
if index == 'DATA':
return True
return False
df = pd.read_csv('VIP-mf.csv', skiprows= lambda x: fetch_skip(x))
but I get this error:
ParserError: Error tokenizing data. C error: Expected 4 fields in line 311, saw 7
Which I'm assuming is because the csv has "mini-tables" within a single csv. Even though, I only need the final "table" which has the column names:
['per_idx', 'cat_idx', 'dt_idx', 'et_idx', 'geo_idx', 'is_adj', 'val']
Thank you for your help.
P.S If there is another way to do this than fiddling with the skiprows
parameter that also works.
Upvotes: 0
Views: 842
Reputation: 28709
An alternative, if you have access to a Linux machine, you could parse the data through the shell, use grep to get the data, and read it in via pandas. note that for the grep
, I used a count of 100_000 to get the rows after the match; you can vary that depending on how many rows, you think the data will be.
I also assume (and I may be wrong) that the headers will always be per_idx,cat_idx..
, which is what I used in grepping:
from io import StringIO
import pandas as pd
import subprocess
shell_string = """grep -A100000 "per_idx,cat_idx" /home/sam/Downloads/VIP-mf/VIP-mf.csv"""
data = subprocess.run(shell_string,
shell = True,
capture_output = True,
text = True).stdout
df = pd.read_csv(StringIO(data))
df.head(5)
per_idx cat_idx dt_idx et_idx geo_idx is_adj val
0 1 1 1 0 1 0 59516.0
1 1 2 1 0 1 0 25972.0
2 1 3 1 0 1 0 33545.0
3 1 4 1 0 1 0 989.0
4 1 5 1 0 1 0 3763.0
df.shape
(65472, 7)
Upvotes: 1
Reputation: 348
I found the answer in another question found here. I had to make a slight change.
def skip_to(fle, line,**kwargs):
if os.stat(fle).st_size == 0:
raise ValueError("File is empty")
with open(fle) as f:
pos = 0
cur_line = f.readline()
while not cur_line.startswith(line):
pos = f.tell()
cur_line = f.readline()
f.seek(pos)
return pd.read_csv(f, **kwargs)
Then I used:
df = skip_to('path_to_file.csv',"DATA", skiprows=1)
Upvotes: 2