Reputation: 52
I want to read and process a csv file with pandas. The file (as seen below) contains multiple header lines which are indicated by a #
tag. I can import that file easily by using
import pandas as pd
file = "data.csv"
data = pd.read_csv(file, delimiter="\s+",
names=["Time", "Cd", "Cs", "Cl", "CmRoll", "CmPitch", "CmYaw", "Cd(f)",
"Cd(r)", "Cs(f)", "Cs(r)", "Cl(f)", "Cl(r)"],
skiprows=13)
However, I have a lot of such files with different header names and I don't want to name them (Time Cd Cs ...
) manually. Also the number of commented lines is different between each file. So I want to automate that task.
Do I have to use something like regular expression here, before passing the data into a pandas dataframe?
Thanks for any advise.
And yes, the header names are also beginning with an #
.
data.csv:
# Force coefficients
# dragDir : (9.9735673312816520e-01 7.2660490528994301e-02 0.0000000000000000e+00)
# sideDir : (0.0000000000000000e+00 0.0000000000000000e+00 -1.0000000000000002e+00)
# liftDir : (-7.2660490528994315e-02 9.9735673312816520e-01 0.0000000000000000e+00)
# rollAxis : (9.9735673312816520e-01 7.2660490528994301e-02 0.0000000000000000e+00)
# pitchAxis : (0.0000000000000000e+00 0.0000000000000000e+00 -1.0000000000000002e+00)
# yawAxis : (-7.2660490528994315e-02 9.9735673312816520e-01 0.0000000000000000e+00)
# magUInf : 4.5000000000000000e+01
# lRef : 5.9399999999999997e-01
# Aref : 3.5639999999999999e-03
# CofR : (1.4999999999999999e-01 0.0000000000000000e+00 0.0000000000000000e+00)
#
# Time Cd Cs Cl CmRoll CmPitch CmYaw Cd(f) Cd(r) Cs(f) Cs(r) Cl(f) Cl(r)
5e-06 1.8990180226147195e+00 1.4919925634649792e-11 2.1950119509976829e+00 -1.1085971520784955e-02 -1.0863798447281650e+00 9.5910040927874810e-03 9.3842303978657482e-01 9.6059498282814471e-01 9.5910041002474442e-03 -9.5910040853275178e-03 1.1126130770676479e-02 2.1838858202270064e+00
1e-05 2.1428508927716594e+00 1.0045114197556737e-08 2.5051633252700962e+00 -1.2652317494411272e-02 -1.2367567798452046e+00 1.0822379290263353e-02 1.0587731288914184e+00 1.0840777638802410e+00 1.0822384312820453e-02 -1.0822374267706254e-02 1.5824882789843508e-02 2.4893384424802525e+00
...
Upvotes: 3
Views: 3564
Reputation: 23099
A little bit of regex could help. This is not the most beautiful of solutions so feel free to post a better solution.
Let's read the first 50 rows of any file to find the last occurrence of the hash which should be the column name.
#
matches the character # literally (case sensitive)Code:
import re
n_rows = 50
path_ = 'your_file_location'
with open(path_,'r') as f:
data = []
for i in range(n_rows): # read only 50 rows here.
for line in f:
if re.match('^#',line):
data.append(line)
start_col = max(enumerate(data))[0]
df = pd.read_csv(path_,sep='\s+',skiprows=start_col) # use your actual delimiter.
# Time Cd Cs Cl CmRoll CmPitch \
0 0.000005 1.899018 1.491993e-11 2.195012 -0.011086 -1.086380 0.009591
1 0.000010 2.142851 1.004511e-08 2.505163 -0.012652 -1.236757 0.010822
CmYaw Cd(f) Cd(r) Cs(f) Cs(r) Cl(f) Cl(r)
0 0.938423 0.960595 0.009591 -0.009591 0.011126 2.183886 NaN
1 1.058773 1.084078 0.010822 -0.010822 0.015825 2.489338 NaN
#
in the column name.We can do this in two steps. We can read in 0 rows but slice the header column.
First read in the file from the header row, but set the header
argument to None
so no headers will be set.
We can then set the column headers manually:
df = pd.read_csv(path_,sep='\s+',skiprows=start_col + 1, header=None)
df.columns = pd.read_csv(path_,sep='\s+',skiprows=start_col,nrows=0).columns[1:]
print(df)
Time Cd Cs Cl CmRoll CmPitch CmYaw \
0 0.000005 1.899018 1.491993e-11 2.195012 -0.011086 -1.086380 0.009591
1 0.000010 2.142851 1.004511e-08 2.505163 -0.012652 -1.236757 0.010822
Cd(f) Cd(r) Cs(f) Cs(r) Cl(f) Cl(r)
0 0.938423 0.960595 0.009591 -0.009591 0.011126 2.183886
1 1.058773 1.084078 0.010822 -0.010822 0.015825 2.489338
Upvotes: 1
Reputation: 121
# Read the lines in file
with open(file) as f:
lines = f.readlines()
# Last commented line is header
header = [line for line in lines if line.startswith('#')][-1]
# Strip line and remove '#'
header = header[1:].strip().split()
df = pd.read_csv(file, delimiter="\s+", names=header, comment='#')
Upvotes: 0
Reputation: 10799
This should do, it's easy and efficient, it keeps variables at the minimum and it doesn't require any input aside from the filename.
with open(file, 'r') as f:
for line in f:
if line.startswith('#'):
header = line
else:
break #stop when there are no more #
header = header[1:].strip().split()
data = pd.read_csv(file, delimiter="\s+", comment='#', names=header)
You first open the file and read only the commented line (it will be fast and memory-efficient). The last valid line will be the final header, which will be cleaned and converted to a list. Finally, you open the file with pandas.read_csv()
with comment='#'
, which will skip the commented lines, and names=header
.
Upvotes: 4
Reputation: 3261
Assuming that comments always start with a single '#' and the header is in the last commented line:
import csv
def read_comments(csv_file):
for row in csv_file:
if row[0] == '#':
yield row.split('#')[1].strip()
def get_last_commented_line(filename):
with open(filename, 'r', newline='') as f:
decommented_lines = [line for line in csv.reader(read_comments(f))]
header = decommented_lines[-1]
skiprows = len(decommented_lines)
return header, skiprows
header, skiprows = get_last_commented_line(path)
pd.read_csv(path, names=header, skiprows=skiprows)
Upvotes: 0
Reputation:
To simplify it, and to save time without using loops, you can create 2 dataframes for #
commented rows, and the rest.
From those commented rows take last one - that's your header, and then merge data dataframe and this title using concat()
also if it's neccesary to assign first row as header you can use df.columns=df.iloc[0]
df = pd.DataFrame({
'A':['#test1 : (000000)','#test1 (000000)','#test1 (000000)','#test1 (000000)','#Time (000000)','5e-06','1e-05'],
})
print(df)
A
0 #test1 : (000000)
1 #test1 (000000)
2 #test1 (000000)
3 #test1 (000000)
4 #Time (000000)
5 5e-06
6 1e-05
df_header = df[df.A.str.contains('^#')]
print(df_header)
A
0 #test1 : (000000)
1 #test1 (000000)
2 #test1 (000000)
3 #test1 (000000)
4 #Time (000000)
df_data = df[~df.A.str.contains('^#')]
print(df_data)
A
5 5e-06
6 1e-05
df = (pd.concat([df_header.iloc[[-1]],df_data])).reset_index(drop=True)
df.A=df.A.str.replace(r'^#',"")
print(df)
A
0 Time (000000)
1 5e-06
2 1e-05
Upvotes: 0
Reputation: 1934
What about extracting the header before you read the file?
We only assume that your header lines start with #
. Extraction of the header as well as its position in the file is automated. We also ensure that no more lines than necessary are read (except the first data line).
with open(file) as f:
line = f.readline()
cnt = 0
while line.startswith('#'):
prev_line = line
line = f.readline()
cnt += 1
# print(prev_line)
header = prev_line.strip().lstrip('# ').split()
df = pd.read_csv(file, delimiter="\s+",
names=header,
skiprows=cnt
)
With this, you can also proccess the other header lines. It also gives you the position of the header in the file.
Upvotes: 2