Reputation: 133
I was hoping to get some additional input here. My problem is that an output file from an external program produces a text file with a rather complicated header. This header looks like this (with a few sample rows):
* NAME KEYWORD S BETX ALFX MUX BETY ALFY MUY X PX Y PY T PT DX DPX DY DPY L LRAD ANGLE K1L K1S K2L K2S TILT E1 E2 FINT FINTX APERTYPE APER_1 APER_2 APER_3 APER_4 COMMENTS KSI HKICK VKICK VOLT LAG FREQ HARMON RE11 RE12 RE13 RE14 RE15 RE16 RE21 RE22 RE23 RE24 RE25 RE26 RE31 RE32 RE33 RE34 RE35 RE36 RE41 RE42 RE43 RE44 RE45 RE46 RE51 RE52 RE53 RE54 RE55 RE56 RE61 RE62 RE63 RE64 RE65 RE66
$ %s %s %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %s %le %le %le %le %s %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le %le
"L000013$START" "MARKER" 0 0.99997544084968948 -2.4868542792772026e-05 0 0.0016028062114609705 -0.0015362599226402803 0 -3.9960208164599792e-12 -2.3117945993838543e-07 -3.1289252451959499e-21 -3.5787461173940813e-18 0 0 -8.0262719745944669e-08 9.9999822209857522e-07 -2.5072600388022476e-21 1.5546880440816971e-19 0 0 0 0 0 0 0 0 0 0 0 0 "CIRCLE" 0 0 0 0 "" 0 0 0 0 0 0 0 0.78625931963285645 0.61786193233791331 8.4376949871511897e-15 -5.7137454489986084e-17 0 -6.3501624209261554e-07 -0.61789228216621062 0.78629005103940219 1.5432100042289676e-14 -2.9815559743351372e-17 0 1.6411585361703815e-07 5.0306980803327406e-17 -7.3725747729014302e-18 0.42113408349746351 0.0014527043440388243 0 6.7762635780344027e-21 -2.5909612996755094e-14 1.6239180139487885e-14 -565.47866828536485 0.42391886365021597 0 2.3716922523120409e-19 2.6333401523484748e-07 -6.0070789235093852e-07 6.9388939039072284e-18 0 1 -0.7100070931506306 0 0 0 0 0 1
"IP.1" "MARKER" 0 0.99997544084968948 -2.4868542792772026e-05 0 0.0016028062114609705 -0.0015362599226402803 0 -3.9960208164599792e-12 -2.3117945993838543e-07 -3.1289252451959499e-21 -3.5787461173940813e-18 0 0 -8.0262719745944669e-08 9.9999822209857522e-07 -2.5072600388022476e-21 1.5546880440816971e-19 0 0 0 0 0 0 0 0 0 0 0 0 "CIRCLE" 0.014999999999999999 0 0 0 "" 0 0 0 0 0 0 0 0.78625931963285645 0.61786193233791331 8.4376949871511897e-15 -5.7137454489986084e-17 0 -6.3501624209261554e-07 -0.61789228216621062 0.78629005103940219 1.5432100042289676e-14 -2.9815559743351372e-17 0 1.6411585361703815e-07 5.0306980803327406e-17 -7.3725747729014302e-18 0.42113408349746351 0.0014527043440388243 0 6.7762635780344027e-21 -2.5909612996755094e-14 1.6239180139487885e-14 -565.47866828536485 0.42391886365021597 0 2.3716922523120409e-19 2.6333401523484748e-07 -6.0070789235093852e-07 6.9388939039072284e-18 0 1 -0.7100070931506306 0 0 0 0 0 1
"SOL1R1.1" "SOLENOID" 1.0001125105478401 2.0002711841856438 -1.0001574437785969 0.1250089995511387 624.05172971938396 -623.97782494270291 0.24950043267581581 -2.5820171273456538e-07 -2.7166326049300749e-07 -2.4644694700727079e-05 -2.4641900061065806e-05 -6.0732928168700292e-10 0 1.2320286106432364e-06 1.080963267922299e-06 4.9287833861482926e-05 2.4640168503266348e-05 1.0001125105478401 0 0 0 0 0 0 0 0 0 0 0 "CIRCLE" 0.014999999999999999 0 0 0 "" 0.0032857664417938401 0 0 0 0 0 0 0.16677454852979498 1.2374450524095637 -0.92870122060954885 0.9271984583845112 0 2.2616270165159844e-05 -0.61941520497814673 1.4057728585838938 -0.92799611647829161 0.92750603109268515 0 2.3209525218003638e-05 -0.92870122060959848 0.92719845838457737 -565.11861313300869 565.60812571287204 0 0.013966222544609984 -0.92799611647833302 0.9275060310927431 -565.4756174247633 565.96366804371507 0 0.013950409023346324 -2.2718014995738109e-05 2.2030728038876717e-05 -0.01392194968977813 0.013909296330422054 1 -0.7100067496917396 0 0 0 0 0 1
As can be seen, the first character is an asterisk followed by the actual first column name. Furthermore, not all columns are separated by the same amount of white spaces. I'm using pandas read_table()
by specifying the separator to be r'\s+
(would prefer to keep it over regex since that drops back to python
engine in pandas
):
df = read_table( filename, sep = r'\s+', index_col = False )
But then, the resulting DataFrame
has a mismatch in columns since the separator also counts the first white space in * NAME
. It can be seen by using the provided snippet as MWE. That leads to a bunch of problems (for example specifying the dtypes for each column). The last columns therefore is filled with NaN
to match the longer header with columns.
EDIT 1 Using the following would solve the problem but requires some gymnastics:
df = read_table( filename, sep=r'\s+', skiprows=[1] )
names = df.columns[1:]
df.drop( df.columns[len(df.columns)-1], axis = 1, inplace = True )
df.columns = names
First, extract column names, strip the *
, drop the last column with NaN
and replace the names.
Another way is to manually delete the asterisk in the file and then reading it into a DataFrame
. That works but is not the best solution for several reasons.
Maybe someone can look at the approach above and tell me if it's a good solution or if there are better ways to do it.
Thanks in advance!
Upvotes: 1
Views: 707
Reputation: 271
So if I understand you correctly, IF there is an asterisk in the first line of the text (assuming a .txt
file), you need to remove it?
In that case, you can "clean" the file as follows by creating a new clean file. If you wrap this into a function, then this can be done for all files that your program outputs
import re
with open('test_text.txt', 'r') as input_file: # Assuming that this is the name of your text file
i = 1
with open('clean_test.txt', 'w') as output: #clean_tst.txt will be the name of the output, choose any name you like here
for line in input_file:
if i == 1: # This if is a bit ugly, but makes sure you're only removing the asterisk from the first row
line = re.sub('^\*\s+', '', line)
i += 1
output.write(line)
-- Edit, to edit the file inplace, following function should do the trick:
import re
import pandas as pd
import io
def clean_file(input_file):
output_rep = io.StringIO()
with open(input_file, 'r') as input_file:
i = 1
for line in input_file:
if i == 1: # This if is a bit ugly, but makes sure you're only removing the asterisk from the first row, NOT any other ones
line = re.sub('^\*\s+', '', line)
i += 1
output_rep.write(line)
output_rep.seek(0)
return output_rep
input_file = 'test_text.txt'
test = clean_file(input_file)
df = pd.read_table(test, sep = r'\s+', index_col = False)
Upvotes: 0
Reputation: 2407
You can try:
df= pd.read_csv(filename,sep=r"\s{2,}",engine="python",skiprows=[1])
and modify the column names:
df.columns=[ colname.replace("*","").strip() for colname in df.columns]
Edit: You can use "rename", too:
df= df.rename(columns=lambda cname: cname.replace("*","").strip())
or
df= df.rename(columns={"* Name":"Name"})
Upvotes: 1