Reputation: 377
I have a log file from a device with some information for analysis, and I am using Pandas and Python for reading this file for some analysis purposes. So, let's suppose below is the content of the file.
20:03:42.769 0xAA 0x99->0x98 A200000000000000000088DD
20:03:42.769 0x11 0x00->0x00 B2
20:03:42.815 0xAA 0x98->0x99 5203001C0C1C0C0E0E
20:03:42.831 0x11 0x00->0x00 EE
20:03:50.726 0xAA 0x99->0x98 A200000000000000000088DD
20:03:50.819 0xAA 0x99->0x98 A200000000000000000088DD
20:03:50.895 0xAA 0x99->0x98 A200000000000000000088DD
20:03:54.841 0xAA 0x98->0x99 1202
20:03:54.857 0x11 0x00->0x00 E0
20:03:54.935 0xAA 0x99->0x98 120300410102000C070A00
20:03:54.935 0x11 0x00->0x00 AB
20:03:55.013 0xAA 0x99->0x98 120301050000050C000003
And I want to separate the information into individual columns, like below.
20:03:42.769 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
20:03:42.769 0x11 0x00->0x00 B2
20:03:42.815 0xAA 0x98->0x99 52 03 00 1C 0C 1C 0C 0E 0E
20:03:42.831 0x11 0x00->0x00 EE
20:03:50.726 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
20:03:50.819 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
20:03:50.895 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
20:03:54.841 0xAA 0x98->0x99 12 02
20:03:54.857 0x11 0x00->0x00 E0
20:03:54.935 0xAA 0x99->0x98 12 03 00 41 01 02 00 0C 07 0A 00
20:03:54.935 0x11 0x00->0x00 AB
20:03:55.013 0xAA 0x99->0x98 12 03 01 05 00 00 05 0C 00 00 03
With pd.read_csv, I can separate the data, but how to split the last column into fixed-width with 2 characters in each column, the same thing happen with pd.read_fwf
Time Header Src->Dest Data
0 20:03:42.769 0xAA 0x99->0x98 A200000000000000000088DD
1 20:03:42.769 0x11 0x00->0x00 B2
2 20:03:42.815 0xAA 0x98->0x99 5203001C0C1C0C0E0E
3 20:03:42.831 0x11 0x00->0x00 EE
4 20:03:50.726 0xAA 0x99->0x98 A200000000000000000088DD
The problem is how to split the last column data further with 2 characters in each column.
I also tried to do this also.
colspecs = [ (0,12), (13,17), (18,28), (29,31), (31,33) ]
df = pd.read_fwf("SampleLog.txt", colspecs = colspecs, names=["Time", "Header", "Src->Dest", "C1", "C2"])
And I get this.
Time Header Src->Dest C1 C2
0 20:03:42.769 0xAA 0x99->0x98 A2 0.0
1 20:03:42.769 0x11 0x00->0x00 B2 NaN
2 20:03:42.815 0xAA 0x98->0x99 52 3.0
3 20:03:42.831 0x11 0x00->0x00 EE NaN
4 20:03:50.726 0xAA 0x99->0x98 A2 0.0
Here it looks like this will work, but the problem is I am getting float values in column "C2" and I will also get the same behavior in other columns if I add. The reason for this behavior is due to the "NaN", which converts complete column into float/double type. I tried replacing "NaN" with something else, but not able to use it. Can someone please help me in solving this problem which I am facing?
Upvotes: 1
Views: 165
Reputation: 150765
You can try the default guess and parse the data:
df = pd.read_fwf('a.csv', names=["Time", "Header", "Src->Dest", "C1"])
(df.drop('C1', axis=1)
.join(df['C1'].str.extractall('(.{2})')[0].unstack('match'))
)
Output:
Time Header Src->Dest 0 1 2 3 4 5 6 7 8 9 10 11
0 20:03:42.769 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
1 20:03:42.769 0x11 0x00->0x00 B2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 20:03:42.815 0xAA 0x98->0x99 52 03 00 1C 0C 1C 0C 0E 0E NaN NaN NaN
3 20:03:42.831 0x11 0x00->0x00 EE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 20:03:50.726 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
5 20:03:50.819 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
6 20:03:50.895 0xAA 0x99->0x98 A2 00 00 00 00 00 00 00 00 00 88 DD
7 20:03:54.841 0xAA 0x98->0x99 12 02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 20:03:54.857 0x11 0x00->0x00 E0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 20:03:54.935 0xAA 0x99->0x98 12 03 00 41 01 02 00 0C 07 0A 00 NaN
10 20:03:54.935 0x11 0x00->0x00 AB NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 20:03:55.013 0xAA 0x99->0x98 12 03 01 05 00 00 05 0C 00 00 03 NaN
Upvotes: 2