ThanksForHelping
ThanksForHelping

Reputation: 161

Data processing with Python on data separated by rows and columns

I have a data that is separated by rows(date time row) as such

01-Jan-1990 00:00:01 ABCD
A  abcde fghijk lmnopq
     hsjfne qqq                     # EDITED WITH ADDITONAL SPILL OVER DATA with \t
B abcde fghijk lmnopq
01-Jan-1990 00:00:05 ABCD
A ancfjhr sfjerhj egen
C etfhw3uh uhuefwh fewvjh dfeg efwbywgefb
D wrf fcwewe fvwefwe fwef
01-Jan-1990 00:00:07 ABCD
A wfw fbebwu
B fewhuf ifgiwejhifgj fijweij

Would like to clean it in a way to separate the A, B , C, etc as shown in the 1st value after the date time row as a column and the values that come after the A,B,C as another column followed by capturing the date time and input as another column. Something like this

A,abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 #WOULD LIKE TO COMBINE THE SPILL DATA
B,abcde fghijk lmnopq, 01-Jan-1990 00:00:01
A,ancfjhr sfjerhj egen,01-Jan-1990 00:00:05
C,etfhw3uh uhuefwh fewvjh dfeg efwbywgefb,01-Jan-1990 00:00:05
D,wrf fcwewe fvwefwe fwefe,01-Jan-1990 00:00:05

etc etc etc

Would greatly appeciate if anyone can guide me along. I've try reading by doing a pattern match then to grab the following lines but unable to accomplish it.

import re
#Log Reading

log=open("IDM.txt","r")


for line in log:
    splitLine = line.split()
    iterator = iter(splitLine)
    datematch = (re.match('^(([0-9])|([0-2][0-9])|([3][0-1])- 
   (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-\\d{4}$',splitLine[0]))
if datematch:
    print(line)

Understand that the code as above is nothing alike to what i want to achieve therefore would like you guys to help me to guide me along and to show that i've tried some stuff. Thank you for your time

EDITED: Included the 3rd line of data to show the spill over of value from 2nd line of data with a \t tabspace before line

Upvotes: 1

Views: 194

Answers (2)

n1tk
n1tk

Reputation: 2500

Another easy approach is to use a Regular Expression: Regular Expression HOWTO and Print lists in Python

  • read from the .txt file IDM.txt
  • removed whitespace on the left side using lstrip()
  • created regular expression pattern_num for finding the match line starting with digits
  • log string formatted per OP request
  • write the final results to IDM_clean.txt

Update: Final and last solution as Generalization:

import re


pattern_num = re.compile(r'^[0-9]') # patter we look in the string

log_list = []


#for line in file_as_list:
file_as_list = []

lines = open("IDM.txt", "r").read().split("\n")
for i, line in enumerate(lines):
    if line.startswith(" "):
        lines[i-1] = lines[+1].strip() + " " + line.lstrip()
        lines.pop(i)
    logs = '\n'.join(lines)+"\n"

file_as_list = logs.splitlines()

for l in file_as_list:
    if re.match(pattern_num, l):
        datos = l
    else:
        info = l[0] + ', ' + l[1:].lstrip()
        log_list.append(info + ', ' + datos)

        log = '\n'.join(map(str, log_list))

open("IDM_clean.txt", "w").write(log+"\n") # write to the file the result       


print("-----------------------------------")
print(type(log))
print("------------------------------------------------------------------------")
print(log)#print the desired format
print("------------------------------------------------------------------------")
Out:
----------------------------------
<class 'str'>
-----------------------------------------------------------------------
A, abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 ABCD
B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01 ABCD
A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05 ABCD
C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05 ABCD
D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05 ABCD
A, wfw fbebwu, 01-Jan-1990 00:00:07 ABCD
B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07 ABCD
-----------------------------------------------------------------------

Screen from the file:

A, abcde fghijk lmnopq hsjfne qqq, 01-Jan-1990 00:00:01 ABCD
B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01 ABCD
A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05 ABCD
C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05 ABCD
D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05 ABCD
A, wfw fbebwu, 01-Jan-1990 00:00:07 ABCD
B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07 ABCD

Upvotes: 0

BernardL
BernardL

Reputation: 5434

It is always a good idea to open a file using with open(), then you can parse the lines as you want in a list, in my case I just checked to see the first 2 characters of the line is a digit, if it is, it stores the value to be added to the desired line later:

import csv
content = []

with open('IDM.txt','r') as f:
    lines = f.readlines()
    for idx,line in enumerate(lines):
        if line[:2].isdigit():
                date = line[:20]

        elif idx == len(lines)-1 or (line[0] != ' ' and lines[idx+1][0] != ' '):
            data = line[0] + ',' + line[1:].rstrip('\n') 
            content.append(data+ ', '+ date)  

        elif lines[idx+1][0] == ' ':
            spill = lines[idx+1].rstrip('\n').strip()
            data = line[0] + ',' + line[1:].rstrip('\n') + ' ' + spill
            content.append(data+ ', '+ date)

        else:
            pass


with open('IDMOutput.csv','w') as f:
    for line in content:
        f.write("%s\n" % line)

>>content
['A, abcde fghijk lmnopq hsjfne qqqqq, 01-Jan-1990 00:00:01',
 'B, abcde fghijk lmnopq, 01-Jan-1990 00:00:01',
 'A, ancfjhr sfjerhj egen, 01-Jan-1990 00:00:05',
 'C, etfhw3uh uhuefwh fewvjh dfeg efwbywgefb, 01-Jan-1990 00:00:05',
 'D, wrf fcwewe fvwefwe fwef, 01-Jan-1990 00:00:05',
 'A, wfw fbebwu, 01-Jan-1990 00:00:07',
 'B, fewhuf ifgiwejhifgj fijweij, 01-Jan-1990 00:00:07']

Edited: Added rstrip to remove '\n' and included timestamp and spill over updated in question with the output.

Upvotes: 1

Related Questions