seanarcher7
seanarcher7

Reputation: 35

Python to remove extra delimiter

We have a 100MB pipe delimited file that has 5 column/4 delimiters each separated by a pipe. However there are few rows where the second column has an extra pipe. For these few rows total delimiter are 5.

For example, in the below 4 rows, the 3rd is a problematic one as it has an extra pipe.

1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6

Is there any way we can remove an extra pipe from the second position where the delimiter count for the row is 5. So, post correction, the file needs to look like below.

1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6

Please note that the file size is 100 MB. Any help is appreciated.

Upvotes: 0

Views: 1131

Answers (2)

will f
will f

Reputation: 483

A simple regular expression pattern like this works on Python 3.7.3:

from re import compile


bad_pipe_re = compile(r"[ \w]+\|[ \w]+(\|)[ \w]+\|[ \w]+\|[ \w]+\|[ \w]+\n")


with open("input", "r") as fp_1, open("output", "w") as fp_2:
    line = fp_1.readline()
    while line is not "":
        mo = bad_pipe_re.fullmatch(line)
        if mo is not None:
            line = line[:mo.start(1)] + line[mo.end(1):]
        fp_2.write(line)
        line = fp_1.readline()

Upvotes: 1

Niel Godfrey P. Ponciano
Niel Godfrey P. Ponciano

Reputation: 10709

Source: my_file.txt

1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
E|1 |9 |2 |8 |Not| a |text|!!!|3|7|4

Code

# If using Python3.10, this can be Parenthesized context managers
# https://docs.python.org/3.10/whatsnew/3.10.html#parenthesized-context-managers
with open('./my_file.txt') as file_src, open('./my_file_parsed.txt', 'w') as file_dst:
    for line in file_src.readlines():
        # Split the line by the character '|'
        line_list = line.split('|')

        if len(line_list) <= 5:
            # If the number of columns doesn't exceed, just write the original line as is.
            file_dst.write(line)
        else:
            # If the number of columns exceeds, count the number of columns that should be merged.
            to_merge_columns_count = (len(line_list) - 5) + 1
            # Merge the columns from index 1 to index x which includes all the columns to be merged.
            merged_column = "".join(line_list[1:1+to_merge_columns_count])
            # Replace all the items from index 1 to index x with the single merged column
            line_list[1:1+to_merge_columns_count] = [merged_column]

            # Write the updated line.
            file_dst.write("|".join(line_list))

Result: my_file_parsed.txt

1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
E|1 9 2 8 Not a text!!!|3|7|4

Upvotes: 1

Related Questions