Alex
Alex

Reputation: 146

Getting pandas.read_csv to handle whitespace in first column of .txt

I'm having difficulties getting pandas to read a .txt-file whose first column contains sections of data, leaving some of its rows blank. The raw data is separated by multiple whitespaces of different lengths, and has format:

Sensor  Value  Name  Case 
sec01   379    S01   H4   
sec02   1183   S02   A1   
        948    S03   A1   
        6948   S04   A1   
        2      S05   C1   
sec03   -722   S06   E1   
        -466   S07   B1   
        -289   S08   F1   
        -105   S09   F1   

Where the sensor column is filled with whitespace for indentation where there is no "sensor" cell. Passing:

df = pd.read_csv(filename,
                 sep = '\s+',
                 header = None,
                 skipinitialspace = False,
                 encoding = 'unicode_escape',
                 engine = 'python'
                 )

, results in output of the form:

Sensor  Value  Name  Case 
sec01     379   S01    H4   
sec02    1183   S02    A1   
948       S03    A1  None
6948      S04    A1  None
2         S05    C1  None
sec03    -722   S06    E1   
-466      S07    B1  None
-289      S08    F1  None
-105      S09    F1  None

Where columns get mixed up. The desired output is to have None or NaN for missing "sensor" cells. What would be the correct separator/delimiter input in this case? Or does .txt-file need to be rearranged beforehand?

Upvotes: 0

Views: 366

Answers (2)

AlexAMC
AlexAMC

Reputation: 352

Your data is fixed-width so you can use the pd.read_fwf function to read in the data.

df = pd.read_fwf(filename)

Pretty useful read in method if you find yourself with well-structured data- Docs

Upvotes: 1

foglerit
foglerit

Reputation: 8279

You can use pandas read_fwf to read the data as fixed-width formatted lines and then forward-fill the null values in Sensor:

import pandas as pd
from io import StringIO

data = """Sensor  Value  Name  Case 
sec01   379    S01   H4   
sec02   1183   S02   A1   
        948    S03   A1   
        6948   S04   A1   
        2      S05   C1   
sec03   -722   S06   E1   
        -466   S07   B1   
        -289   S08   F1   
        -105   S09   F1   """

df = pd.read_fwf(StringIO(data))
df["Sensor"] = df.Sensor.fillna(method="ffill")
df

Result:

  Sensor  Value Name Case
0  sec01    379  S01   H4
1  sec02   1183  S02   A1
2  sec02    948  S03   A1
3  sec02   6948  S04   A1
4  sec02      2  S05   C1
5  sec03   -722  S06   E1
6  sec03   -466  S07   B1
7  sec03   -289  S08   F1
8  sec03   -105  S09   F1

Upvotes: 3

Related Questions