Number42
Number42

Reputation: 133

Pandas: intelligent way to infer column names

I was hoping to get some additional input here. My problem is that an output file from an external program produces a text file with a rather complicated header. This header looks like this (with a few sample rows):

* NAME                 KEYWORD                                    S                    BETX                    ALFX                     MUX                    BETY                    ALFY                     MUY                       X                      PX                       Y                      PY                       T                      PT                      DX                     DPX                      DY                     DPY                       L                    LRAD                   ANGLE                     K1L                     K1S                     K2L                     K2S                    TILT                      E1                      E2                    FINT                   FINTX APERTYPE                              APER_1                  APER_2                  APER_3                  APER_4 COMMENTS                                 KSI                   HKICK                   VKICK                    VOLT                     LAG                    FREQ                  HARMON                    RE11                    RE12                    RE13                    RE14                    RE15                    RE16                    RE21                    RE22                    RE23                    RE24                    RE25                    RE26                    RE31                    RE32                    RE33                    RE34                    RE35                    RE36                    RE41                    RE42                    RE43                    RE44                    RE45                    RE46                    RE51                    RE52                    RE53                    RE54                    RE55                    RE56                    RE61                    RE62                    RE63                    RE64                    RE65                    RE66 
$ %s                   %s                                       %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le %s                                       %le                     %le                     %le                     %le %s                                       %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le                     %le 
 "L000013$START"       "MARKER"                                    0     0.99997544084968948 -2.4868542792772026e-05                       0   0.0016028062114609705  -0.0015362599226402803                       0 -3.9960208164599792e-12 -2.3117945993838543e-07 -3.1289252451959499e-21 -3.5787461173940813e-18                       0                       0 -8.0262719745944669e-08  9.9999822209857522e-07 -2.5072600388022476e-21  1.5546880440816971e-19                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0 "CIRCLE"                                    0                       0                       0                       0 ""                                          0                       0                       0                       0                       0                       0                       0     0.78625931963285645     0.61786193233791331  8.4376949871511897e-15 -5.7137454489986084e-17                       0 -6.3501624209261554e-07    -0.61789228216621062     0.78629005103940219  1.5432100042289676e-14 -2.9815559743351372e-17                       0  1.6411585361703815e-07  5.0306980803327406e-17 -7.3725747729014302e-18     0.42113408349746351   0.0014527043440388243                       0  6.7762635780344027e-21 -2.5909612996755094e-14  1.6239180139487885e-14     -565.47866828536485     0.42391886365021597                       0  2.3716922523120409e-19  2.6333401523484748e-07 -6.0070789235093852e-07  6.9388939039072284e-18                       0                       1     -0.7100070931506306                       0                       0                       0                       0                       0                       1
 "IP.1"                "MARKER"                                    0     0.99997544084968948 -2.4868542792772026e-05                       0   0.0016028062114609705  -0.0015362599226402803                       0 -3.9960208164599792e-12 -2.3117945993838543e-07 -3.1289252451959499e-21 -3.5787461173940813e-18                       0                       0 -8.0262719745944669e-08  9.9999822209857522e-07 -2.5072600388022476e-21  1.5546880440816971e-19                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0 "CIRCLE"                 0.014999999999999999                       0                       0                       0 ""                                          0                       0                       0                       0                       0                       0                       0     0.78625931963285645     0.61786193233791331  8.4376949871511897e-15 -5.7137454489986084e-17                       0 -6.3501624209261554e-07    -0.61789228216621062     0.78629005103940219  1.5432100042289676e-14 -2.9815559743351372e-17                       0  1.6411585361703815e-07  5.0306980803327406e-17 -7.3725747729014302e-18     0.42113408349746351   0.0014527043440388243                       0  6.7762635780344027e-21 -2.5909612996755094e-14  1.6239180139487885e-14     -565.47866828536485     0.42391886365021597                       0  2.3716922523120409e-19  2.6333401523484748e-07 -6.0070789235093852e-07  6.9388939039072284e-18                       0                       1     -0.7100070931506306                       0                       0                       0                       0                       0                       1
 "SOL1R1.1"            "SOLENOID"                 1.0001125105478401      2.0002711841856438     -1.0001574437785969      0.1250089995511387      624.05172971938396     -623.97782494270291     0.24950043267581581 -2.5820171273456538e-07 -2.7166326049300749e-07 -2.4644694700727079e-05 -2.4641900061065806e-05 -6.0732928168700292e-10                       0  1.2320286106432364e-06   1.080963267922299e-06  4.9287833861482926e-05  2.4640168503266348e-05      1.0001125105478401                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0                       0 "CIRCLE"                 0.014999999999999999                       0                       0                       0 ""                      0.0032857664417938401                       0                       0                       0                       0                       0                       0     0.16677454852979498      1.2374450524095637    -0.92870122060954885      0.9271984583845112                       0  2.2616270165159844e-05    -0.61941520497814673      1.4057728585838938    -0.92799611647829161     0.92750603109268515                       0  2.3209525218003638e-05    -0.92870122060959848     0.92719845838457737     -565.11861313300869      565.60812571287204                       0    0.013966222544609984    -0.92799611647833302      0.9275060310927431      -565.4756174247633      565.96366804371507                       0    0.013950409023346324 -2.2718014995738109e-05  2.2030728038876717e-05    -0.01392194968977813    0.013909296330422054                       1     -0.7100067496917396                       0                       0                       0                       0                       0                       1

As can be seen, the first character is an asterisk followed by the actual first column name. Furthermore, not all columns are separated by the same amount of white spaces. I'm using pandas read_table() by specifying the separator to be r'\s+ (would prefer to keep it over regex since that drops back to python engine in pandas):

df = read_table( filename, sep = r'\s+', index_col = False )

But then, the resulting DataFrame has a mismatch in columns since the separator also counts the first white space in * NAME. It can be seen by using the provided snippet as MWE. That leads to a bunch of problems (for example specifying the dtypes for each column). The last columns therefore is filled with NaN to match the longer header with columns.

EDIT 1 Using the following would solve the problem but requires some gymnastics:

df = read_table( filename, sep=r'\s+', skiprows=[1] )
names = df.columns[1:]
df.drop( df.columns[len(df.columns)-1], axis = 1, inplace = True )
df.columns = names

First, extract column names, strip the *, drop the last column with NaN and replace the names.

Another way is to manually delete the asterisk in the file and then reading it into a DataFrame. That works but is not the best solution for several reasons.

Maybe someone can look at the approach above and tell me if it's a good solution or if there are better ways to do it.

Thanks in advance!

Upvotes: 1

Views: 707

Answers (2)

Tim.Lucas
Tim.Lucas

Reputation: 271

So if I understand you correctly, IF there is an asterisk in the first line of the text (assuming a .txt file), you need to remove it?

In that case, you can "clean" the file as follows by creating a new clean file. If you wrap this into a function, then this can be done for all files that your program outputs

import re

with open('test_text.txt', 'r') as input_file: # Assuming that this is the name of your text file
    i = 1
    with open('clean_test.txt', 'w') as output: #clean_tst.txt will be the name of the output, choose any name you like here
        for line in input_file:
            if i == 1: # This if is a bit ugly, but makes sure you're only removing the asterisk from the first row
                line  = re.sub('^\*\s+', '', line)
                i += 1
            output.write(line)

-- Edit, to edit the file inplace, following function should do the trick:

import re
import pandas as pd
import io

def clean_file(input_file):
    output_rep = io.StringIO()
    with open(input_file, 'r') as input_file: 
        i = 1
        for line in input_file:
            if i == 1: # This if is a bit ugly, but makes sure you're only removing the asterisk from the first row, NOT any other ones
                line  = re.sub('^\*\s+', '', line)
                i += 1
            output_rep.write(line)
    output_rep.seek(0)
    return output_rep

input_file = 'test_text.txt'
test = clean_file(input_file)

df = pd.read_table(test, sep = r'\s+', index_col = False)

Upvotes: 0

kantal
kantal

Reputation: 2407

You can try:

df= pd.read_csv(filename,sep=r"\s{2,}",engine="python",skiprows=[1])

and modify the column names:

df.columns=[ colname.replace("*","").strip() for colname in df.columns]

Edit: You can use "rename", too:

df= df.rename(columns=lambda cname: cname.replace("*","").strip())

or

df= df.rename(columns={"* Name":"Name"})

Upvotes: 1

Related Questions