SignalProcessed
SignalProcessed

Reputation: 371

Python: Pandas, dealing with spaced column names

If I have multiple text files that I need to parse that look like so, but can vary in terms of column names, and the length of the hashtags above: txt.file

How would I go about turning this into a pandas dataframe? I've tried using pd.read_table('file.txt', delim_whitespace = True, skiprows = 14), but it has all sorts of problems. My issues are...

All the text, asterisks, and pounds at the top needs to be ignored, but I can't just use skip rows because the size of all the junk up top can vary in length in another file.

The columns "stat (+/-)" and "syst (+/-)" are seen as 4 columns because of the whitespace.

The one pound sign is included in the column names, and I don't want that. I can't just assign the column names manually because they vary from text file to text file.

Any help is much obliged, I'm just not really sure where to go from after I read the file using pandas.

Upvotes: 3

Views: 472

Answers (2)

Zeugma
Zeugma

Reputation: 32095

This is the way I'm mentioning in the comment: it uses a file object to skip the custom dirty data you need to skip at the beginning. You land the file offset at the appropriate location in the file where read_fwf simply does the job:

with open(rawfile, 'r') as data_file:
    while(data_file.read(1)=='#'):
        last_pound_pos = data_file.tell()
        data_file.readline()
    data_file.seek(last_pound_pos)
    df = pd.read_fwf(data_file)

df
Out[88]: 
   i      mult  stat (+/-)  syst (+/-)        Q2         x       x.1       Php
0  0  0.322541    0.018731    0.026681  1.250269  0.037525  0.148981  0.104192
1  1  0.667686    0.023593    0.033163  1.250269  0.037525  0.150414  0.211203
2  2  0.766044    0.022712    0.037836  1.250269  0.037525  0.149641  0.316589
3  3  0.668402    0.024219    0.031938  1.250269  0.037525  0.148027  0.415451
4  4  0.423496    0.020548    0.018001  1.250269  0.037525  0.154227  0.557743
5  5  0.237175    0.023561    0.007481  1.250269  0.037525  0.159904  0.750544

Upvotes: 1

Parfait
Parfait

Reputation: 107652

Consider reading in raw file, cleaning it line by line while writing to a new file using csv module. Regex is used to identify column headers using the i as match criteria. Below assumes more than one space separates columns:

import os
import csv, re
import pandas as pd

rawfile = "path/To/RawText.txt"
tempfile = "path/To/TempText.txt"

with open(tempfile, 'w', newline='') as output_file:
    writer = csv.writer(output_file)    

    with open(rawfile, 'r') as data_file:
        for line in data_file:            
            if re.match('^.*i', line):                     # KEEP COLUMN HEADER ROW
                line = line.replace('\n', '')                
                row = line.split("  ")                
                writer.writerow(row)

            elif line.startswith('#') == False:            # REMOVE HASHTAG LINES
                line = line.replace('\n', '')
                row = line.split("  ")            
                writer.writerow(row)

df = pd.read_csv(tempfile)                                 # IMPORT TEMP  FILE
df.columns = [c.replace('# ', '') for c in df.columns]     # REMOVE '#' IN COL NAMES     

os.remove(tempfile)                                        # DELETE TEMP FILE

Upvotes: 2

Related Questions