Neeraj Hanumante
Neeraj Hanumante

Reputation: 1684

Using Pandas to read data and skip metadata

Background

I have data files which consist of two parts: data in CSV format, and Metadata. I can use the method given here 1 and here 2 to manually skip the Metadata portion by specifying the location/line number of the beginning of the Metadata.

Following is the sample of the data file:

3

Here, you can see that I can specify the line number (420) manually and use the following code to skip the Metadata:

with open('data.csv', 'r') as f:
    metadata_location = [i for i, x in enumerate(f.readlines()) if 'Metadata' in x]
with open('data.csv', 'r') as f:
    flat_data = pd.read_csv(f, index_col=False, skiprows=lambda x: x >= metadata_location[0])

with open('data.csv') as f:
    df = pd.read_csv(f, index_col=False)
df = df[:420]

Question

How can I scan the file to capture the Metadata and then skip reading it? (I will need to process multiple such files, hence, I wish to write such a code)

Upvotes: 1

Views: 1223

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71687

IIUC, You can pass the callable function to skiprows argument that will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. Use:

df = pd.read_csv("data.csv", index_col=False, skiprows=lambda x: x >= 420)

UPDATE: To find the metadata location:

import re

md_loc = 0
with open("data.csv") as f:
    for idx, line in enumerate(f):
        if re.search(r'^"Metadata:\s*"$', line):
            md_loc = idx

Upvotes: 1

gtomer
gtomer

Reputation: 6574

You question is not clear. If I got you right, you are looking for a way to scan all the lines and run the above code on each?

EDIT 1:

for index, row in All_Patients_Chosen_Visit.iterrows(): df = row[:420]

See above code. Check if it works

Upvotes: 1

Related Questions