Number42
Number42

Reputation: 133

read_table handling missing strings

While trying to import data from a .txt file using pandas read_table(), I came across an interesting issue.

The file contains a column only partly filled with strings (OrigVol). The empty cells are then filled with values from the following column and NaNs are added at the end, where this shift leads to empty cells. The image below shows the result of using pd.read_table( file.txt, sep = r'\s+') after import pandas as pd.

enter image description here

The column OrigVol should contain empty strings, or NaN's, instead of the column prim_pos_eutags. I'm new to panda and it's diffcult to come up with a MWE that exactly reproduces the problem. One would need the file (which I couldn't upload).

Maybe I missed to specify an import option. Let me know if more information is required.

Any hints are much appreciated!

According to the suggestion, I provide the first 8 lines of an example file

Name            Subtype     ProcName            Material    Creator     OrigVol trklen                  prim_pos_eutags
top             0           initStep            undefined   undefined   top     0                       prim_pos_eudata
top             91          Transportation      Vacuum      undefined           2.27009877562523E-12    prim_pos_eudata
QC5L_2_v        91          Transportation      Vacuum      undefined           3.50000000000227        prim_pos_eudata
DRIFT_8609_v    91          Transportation      Vacuum      undefined           3.80000000000518        prim_pos_eudata
BC1L_2_v        23          SynRad              Vacuum      undefined           68.1607816456518        prim_pos_eudata
BC1L_2_v        91          Transportation      Vacuum      undefined           79.0910350747966        prim_pos_eudata
DRIFT_8610_v    91          Transportation      Vacuum      undefined   QC2_v   79.3910346856657        prim_pos_eudata

Upvotes: 0

Views: 171

Answers (2)

Number42
Number42

Reputation: 133

Thanks to the suggestions by Anant Mittal, the problem could be solved.

However, I did some minor changes to achieve a clean data frame with the first line as header.

input_file  = open('test.txt', "r")
data = []
for line in input_file:
    splitted_list = line.split()

    if len(splitted_list) == 8: data.append( splitted_list )
    if len(splitted_list) < 8: # 8 is number of columns, change it if it is wrong
        splitted_list.insert(5, "") # add an empty string inn the 5th column (change the number if it is wrong)
        data.append(splitted_list)

test = pd.DataFrame.from_records( data )

# first row to select header
header = test.iloc[0] 
# exclude first line 
test = test[1:] 
# reset the header
test.columns = header
test

Upvotes: 1

Anant Mittal
Anant Mittal

Reputation: 2113

The columns are separated with variable number of spaces to make the columns look nice and aligned in the text file. What you can do instead is to read the file line by line, split the line and see if the number of elements in the list generated with line.split() are less than number of columns. If yes, than add an empty string at the required index.

input_file  = open('test.txt', "rb")
data = []
for line in input_file:
    splitted_list = line.split()
    if len(splitted_list) < 8: # 8 is number of columns, change it if it is wrong
        splitted_list.insert(5, "") # add an empty string inn the 5th column (change the number if it is wrong)
        data.append(splitted_list)
df = pd.DataFrame.from_records(data)

Upvotes: 0

Related Questions