xpress_embedo
xpress_embedo

Reputation: 377

Read Data from a file with Fixed Width not at same places using Pandas

I have a log file from a device with some information for analysis, and I am using Pandas and Python for reading this file for some analysis purposes. So, let's suppose below is the content of the file.

20:03:42.769 0xAA 0x99->0x98 A200000000000000000088DD
20:03:42.769 0x11 0x00->0x00 B2
20:03:42.815 0xAA 0x98->0x99 5203001C0C1C0C0E0E
20:03:42.831 0x11 0x00->0x00 EE
20:03:50.726 0xAA 0x99->0x98 A200000000000000000088DD
20:03:50.819 0xAA 0x99->0x98 A200000000000000000088DD
20:03:50.895 0xAA 0x99->0x98 A200000000000000000088DD
20:03:54.841 0xAA 0x98->0x99 1202
20:03:54.857 0x11 0x00->0x00 E0
20:03:54.935 0xAA 0x99->0x98 120300410102000C070A00
20:03:54.935 0x11 0x00->0x00 AB
20:03:55.013 0xAA 0x99->0x98 120301050000050C000003

And I want to separate the information into individual columns, like below.

20:03:42.769 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
20:03:42.769 0x11 0x00->0x00 B2
20:03:42.815 0xAA 0x98->0x99 52 03 00 1C 0C 1C 0C 0E 0E
20:03:42.831 0x11 0x00->0x00 EE
20:03:50.726 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
20:03:50.819 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
20:03:50.895 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
20:03:54.841 0xAA 0x98->0x99 12 02
20:03:54.857 0x11 0x00->0x00 E0
20:03:54.935 0xAA 0x99->0x98 12 03 00 41 01 02 00 0C 07 0A 00 
20:03:54.935 0x11 0x00->0x00 AB
20:03:55.013 0xAA 0x99->0x98 12 03 01 05 00 00 05 0C 00 00 03 

With pd.read_csv, I can separate the data, but how to split the last column into fixed-width with 2 characters in each column, the same thing happen with pd.read_fwf

           Time Header   Src->Dest                      Data
0  20:03:42.769   0xAA  0x99->0x98  A200000000000000000088DD
1  20:03:42.769   0x11  0x00->0x00                        B2
2  20:03:42.815   0xAA  0x98->0x99        5203001C0C1C0C0E0E
3  20:03:42.831   0x11  0x00->0x00                        EE
4  20:03:50.726   0xAA  0x99->0x98  A200000000000000000088DD

The problem is how to split the last column data further with 2 characters in each column.

I also tried to do this also.

colspecs = [ (0,12), (13,17), (18,28), (29,31), (31,33) ]
df = pd.read_fwf("SampleLog.txt", colspecs = colspecs, names=["Time", "Header", "Src->Dest", "C1", "C2"])

And I get this.

           Time Header   Src->Dest  C1   C2
0  20:03:42.769   0xAA  0x99->0x98  A2  0.0
1  20:03:42.769   0x11  0x00->0x00  B2  NaN
2  20:03:42.815   0xAA  0x98->0x99  52  3.0
3  20:03:42.831   0x11  0x00->0x00  EE  NaN
4  20:03:50.726   0xAA  0x99->0x98  A2  0.0

Here it looks like this will work, but the problem is I am getting float values in column "C2" and I will also get the same behavior in other columns if I add. The reason for this behavior is due to the "NaN", which converts complete column into float/double type. I tried replacing "NaN" with something else, but not able to use it. Can someone please help me in solving this problem which I am facing?

Upvotes: 1

Views: 165

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150765

You can try the default guess and parse the data:

df = pd.read_fwf('a.csv', names=["Time", "Header", "Src->Dest", "C1"])
(df.drop('C1', axis=1)
   .join(df['C1'].str.extractall('(.{2})')[0].unstack('match'))
)

Output:

            Time    Header  Src->Dest   0   1   2   3   4   5   6   7   8   9   10  11
0   20:03:42.769    0xAA    0x99->0x98  A2  00  00  00  00  00  00  00  00  00  88  DD
1   20:03:42.769    0x11    0x00->0x00  B2  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   20:03:42.815    0xAA    0x98->0x99  52  03  00  1C  0C  1C  0C  0E  0E  NaN NaN NaN
3   20:03:42.831    0x11    0x00->0x00  EE  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4   20:03:50.726    0xAA    0x99->0x98  A2  00  00  00  00  00  00  00  00  00  88  DD
5   20:03:50.819    0xAA    0x99->0x98  A2  00  00  00  00  00  00  00  00  00  88  DD
6   20:03:50.895    0xAA    0x99->0x98  A2  00  00  00  00  00  00  00  00  00  88  DD
7   20:03:54.841    0xAA    0x98->0x99  12  02  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8   20:03:54.857    0x11    0x00->0x00  E0  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9   20:03:54.935    0xAA    0x99->0x98  12  03  00  41  01  02  00  0C  07  0A  00  NaN
10  20:03:54.935    0x11    0x00->0x00  AB  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11  20:03:55.013    0xAA    0x99->0x98  12  03  01  05  00  00  05  0C  00  00  03  NaN

Upvotes: 2

Related Questions