Reputation: 161
I have a data that is separated by rows(date time row) as such
01-Jan-1990 00:00:01 ABCD
A abcde fghijk lmnopq
hsjfne qqq # EDITED WITH ADDITONAL SPILL OVER DATA with \t
B abcde fghijk lmnopq
01-Jan-1990 00:00:05 ABCD
A ancfjhr sfjerhj egen
C etfhw3uh uhuefwh fewvjh dfeg efwbywgefb
D wrf fcwewe fvwefwe fwef
01-Jan-1990 00:00:07 ABCD
A wfw fbebwu
B fewhuf ifgiwejhifgj fijweij
Would like to clean it in a way to separate the A, B , C, etc as shown in the 1st value after the date time row as a column and the values that come after the A,B,C as another column followed by capturing the date time and input as another column. Something like this
A,abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 #WOULD LIKE TO COMBINE THE SPILL DATA
B,abcde fghijk lmnopq, 01-Jan-1990 00:00:01
A,ancfjhr sfjerhj egen,01-Jan-1990 00:00:05
C,etfhw3uh uhuefwh fewvjh dfeg efwbywgefb,01-Jan-1990 00:00:05
D,wrf fcwewe fvwefwe fwefe,01-Jan-1990 00:00:05
etc etc etc
Would greatly appeciate if anyone can guide me along. I've try reading by doing a pattern match then to grab the following lines but unable to accomplish it.
import re
#Log Reading
log=open("IDM.txt","r")
for line in log:
splitLine = line.split()
iterator = iter(splitLine)
datematch = (re.match('^(([0-9])|([0-2][0-9])|([3][0-1])-
(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-\\d{4}$',splitLine[0]))
if datematch:
print(line)
Understand that the code as above is nothing alike to what i want to achieve therefore would like you guys to help me to guide me along and to show that i've tried some stuff. Thank you for your time
EDITED: Included the 3rd line of data to show the spill over of value from 2nd line of data with a \t tabspace before line
Upvotes: 1
Views: 194
Reputation: 2500
Another easy approach is to use a Regular Expression: Regular Expression HOWTO and Print lists in Python
.txt
file IDM.txt
lstrip()
pattern_num
for finding the match line starting with digitslog
string formatted per OP requestIDM_clean.txt
Update: Final and last solution as
Generalization
:
import re
pattern_num = re.compile(r'^[0-9]') # patter we look in the string
log_list = []
#for line in file_as_list:
file_as_list = []
lines = open("IDM.txt", "r").read().split("\n")
for i, line in enumerate(lines):
if line.startswith(" "):
lines[i-1] = lines[+1].strip() + " " + line.lstrip()
lines.pop(i)
logs = '\n'.join(lines)+"\n"
file_as_list = logs.splitlines()
for l in file_as_list:
if re.match(pattern_num, l):
datos = l
else:
info = l[0] + ', ' + l[1:].lstrip()
log_list.append(info + ', ' + datos)
log = '\n'.join(map(str, log_list))
open("IDM_clean.txt", "w").write(log+"\n") # write to the file the result
print("-----------------------------------")
print(type(log))
print("------------------------------------------------------------------------")
print(log)#print the desired format
print("------------------------------------------------------------------------")
Out: ---------------------------------- <class 'str'> ----------------------------------------------------------------------- A, abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 ABCD B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01 ABCD A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05 ABCD C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05 ABCD D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05 ABCD A, wfw fbebwu, 01-Jan-1990 00:00:07 ABCD B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07 ABCD -----------------------------------------------------------------------
Screen from the file:
A, abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 ABCD
B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01 ABCD
A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05 ABCD
C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05 ABCD
D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05 ABCD
A, wfw fbebwu, 01-Jan-1990 00:00:07 ABCD
B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07 ABCD
Upvotes: 0
Reputation: 5434
It is always a good idea to open a file using with open()
, then you can parse the lines as you want in a list, in my case I just checked to see the first 2 characters of the line is a digit, if it is, it stores the value to be added to the desired line later:
import csv
content = []
with open('IDM.txt','r') as f:
lines = f.readlines()
for idx,line in enumerate(lines):
if line[:2].isdigit():
date = line[:20]
elif idx == len(lines)-1 or (line[0] != ' ' and lines[idx+1][0] != ' '):
data = line[0] + ',' + line[1:].rstrip('\n')
content.append(data+ ', '+ date)
elif lines[idx+1][0] == ' ':
spill = lines[idx+1].rstrip('\n').strip()
data = line[0] + ',' + line[1:].rstrip('\n') + ' ' + spill
content.append(data+ ', '+ date)
else:
pass
with open('IDMOutput.csv','w') as f:
for line in content:
f.write("%s\n" % line)
>>content
['A, abcde fghijk lmnopq hsjfne qqqqq, 01-Jan-1990 00:00:01',
'B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01',
'A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05',
'C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05',
'D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05',
'A, wfw fbebwu, 01-Jan-1990 00:00:07',
'B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07']
Edited: Added rstrip
to remove '\n'
and included timestamp
and spill over updated in question with the output.
Upvotes: 1