Reputation: 1684
Background
I have data files which consist of two parts: data in CSV format, and Metadata. I can use the method given here 1 and here 2 to manually skip the Metadata portion by specifying the location/line number of the beginning of the Metadata.
Following is the sample of the data file:
Here, you can see that I can specify the line number (420) manually and use the following code to skip the Metadata:
with open('data.csv', 'r') as f:
metadata_location = [i for i, x in enumerate(f.readlines()) if 'Metadata' in x]
with open('data.csv', 'r') as f:
flat_data = pd.read_csv(f, index_col=False, skiprows=lambda x: x >= metadata_location[0])
with open('data.csv') as f:
df = pd.read_csv(f, index_col=False)
df = df[:420]
Question
How can I scan the file to capture the Metadata and then skip reading it? (I will need to process multiple such files, hence, I wish to write such a code)
Upvotes: 1
Views: 1223
Reputation: 71687
IIUC, You can pass the callable function to skiprows
argument that will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. Use:
df = pd.read_csv("data.csv", index_col=False, skiprows=lambda x: x >= 420)
UPDATE: To find the metadata location:
import re
md_loc = 0
with open("data.csv") as f:
for idx, line in enumerate(f):
if re.search(r'^"Metadata:\s*"$', line):
md_loc = idx
Upvotes: 1
Reputation: 6574
You question is not clear. If I got you right, you are looking for a way to scan all the lines and run the above code on each?
EDIT 1:
for index, row in All_Patients_Chosen_Visit.iterrows():
df = row[:420]
See above code. Check if it works
Upvotes: 1