Using Pandas to read data and skip metadata

Question

Background

I have data files which consist of two parts: data in CSV format, and Metadata. I can use the method given here 1 and here 2 to manually skip the Metadata portion by specifying the location/line number of the beginning of the Metadata.

Following is the sample of the data file:

Here, you can see that I can specify the line number (420) manually and use the following code to skip the Metadata:

with open('data.csv', 'r') as f:
    metadata_location = [i for i, x in enumerate(f.readlines()) if 'Metadata' in x]
with open('data.csv', 'r') as f:
    flat_data = pd.read_csv(f, index_col=False, skiprows=lambda x: x >= metadata_location[0])

with open('data.csv') as f:
    df = pd.read_csv(f, index_col=False)
df = df[:420]

Question

How can I scan the file to capture the Metadata and then skip reading it? (I will need to process multiple such files, hence, I wish to write such a code)

Shubham Sharma · Accepted Answer

IIUC, You can pass the callable function to skiprows argument that will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. Use:

df = pd.read_csv("data.csv", index_col=False, skiprows=lambda x: x >= 420)

UPDATE: To find the metadata location:

import re

md_loc = 0
with open("data.csv") as f:
    for idx, line in enumerate(f):
        if re.search(r'^"Metadata:\s*"$', line):
            md_loc = idx

Using Pandas to read data and skip metadata

Answers (2)

Related Questions