Reputation: 371
If I have multiple text files that I need to parse that look like so, but can vary in terms of column names, and the length of the hashtags above:
How would I go about turning this into a pandas dataframe? I've tried using pd.read_table('file.txt', delim_whitespace = True, skiprows = 14)
, but it has all sorts of problems. My issues are...
All the text, asterisks, and pounds at the top needs to be ignored, but I can't just use skip rows because the size of all the junk up top can vary in length in another file.
The columns "stat (+/-)" and "syst (+/-)" are seen as 4 columns because of the whitespace.
The one pound sign is included in the column names, and I don't want that. I can't just assign the column names manually because they vary from text file to text file.
Any help is much obliged, I'm just not really sure where to go from after I read the file using pandas.
Upvotes: 3
Views: 472
Reputation: 32095
This is the way I'm mentioning in the comment: it uses a file object to skip the custom dirty data you need to skip at the beginning. You land the file offset at the appropriate location in the file where read_fwf
simply does the job:
with open(rawfile, 'r') as data_file:
while(data_file.read(1)=='#'):
last_pound_pos = data_file.tell()
data_file.readline()
data_file.seek(last_pound_pos)
df = pd.read_fwf(data_file)
df
Out[88]:
i mult stat (+/-) syst (+/-) Q2 x x.1 Php
0 0 0.322541 0.018731 0.026681 1.250269 0.037525 0.148981 0.104192
1 1 0.667686 0.023593 0.033163 1.250269 0.037525 0.150414 0.211203
2 2 0.766044 0.022712 0.037836 1.250269 0.037525 0.149641 0.316589
3 3 0.668402 0.024219 0.031938 1.250269 0.037525 0.148027 0.415451
4 4 0.423496 0.020548 0.018001 1.250269 0.037525 0.154227 0.557743
5 5 0.237175 0.023561 0.007481 1.250269 0.037525 0.159904 0.750544
Upvotes: 1
Reputation: 107652
Consider reading in raw file, cleaning it line by line while writing to a new file using csv
module. Regex is used to identify column headers using the i as match criteria. Below assumes more than one space separates columns:
import os
import csv, re
import pandas as pd
rawfile = "path/To/RawText.txt"
tempfile = "path/To/TempText.txt"
with open(tempfile, 'w', newline='') as output_file:
writer = csv.writer(output_file)
with open(rawfile, 'r') as data_file:
for line in data_file:
if re.match('^.*i', line): # KEEP COLUMN HEADER ROW
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
elif line.startswith('#') == False: # REMOVE HASHTAG LINES
line = line.replace('\n', '')
row = line.split(" ")
writer.writerow(row)
df = pd.read_csv(tempfile) # IMPORT TEMP FILE
df.columns = [c.replace('# ', '') for c in df.columns] # REMOVE '#' IN COL NAMES
os.remove(tempfile) # DELETE TEMP FILE
Upvotes: 2