David 54321
David 54321

Reputation: 728

sed to python replace extra delimiters in a

sed 's/\t/_tab_/3g'

I have a sed command that basically replaces all excess tab delimiters in my text document. My documents are supposed to be 3 columns, but occasionally there's an extra delimiter. I don't have control over the files.

I use the above command to clean up the document. However all my other operations on these files are in python. Is there a way to do the above sed command in python?

sample input:

Column1   Column2         Column3
James     1,203.33        comment1
Mike      -3,434.09       testing testing 123
Sarah     1,343,342.23    there   here

sample output:

Column1   Column2         Column3
James     1,203.33        comment1
Mike      -3,434.09       testing_tab_testing_tab_123
Sarah     1,343,342.23    there_tab_here

Upvotes: 3

Views: 197

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

You may read the file line by line, split with tab, and if there are more than 3 items, join the items after the 3rd one with _tab_:

lines = []
with open('inputfile.txt', 'r') as fr:
    for line in fr:
        split = line.split('\t')
        if len(split) > 3:
            tmp = split[:2]                      # Slice the first two items
            tmp.append("_tab_".join(split[2:]))  # Append the rest joined with _tab_
            lines.append("\t".join(tmp))         # Use the updated line
        else:
            lines.append(line)                   # Else, put the line as is

See the Python demo

The lines variable will contain something like

Mike    -3,434.09   testing_tab_testing_tab_123
Mike    -3,434.09   testing_tab_256
No  operation   here

Upvotes: 1

dimid
dimid

Reputation: 7631

You can mimic the sed behavior in python:

import re

pattern = re.compile(r'\t')
string = 'Mike\t3,434.09\ttesting\ttesting\t123'
replacement = '_tab_'
count = -1
spans = []
start = 2 # Starting index of matches to replace (0 based)
for match in re.finditer(pattern, string):
    count += 1
    if count >= start:
        spans.append(match.span())
spans.reverse()
new_str = string
for sp in spans:
     new_str = new_str[0:sp[0]] + replacement + new_str[sp[1]:]

And now new_str is 'Mike\t3,434.09\ttesting_tab_testing_tab_123'.

You can wrap it in a function and repeat for every line. However, note that this GNU sed behavior isn't standard:

'NUMBER' Only replace the NUMBERth match of the REGEXP.

 interaction in 's' command Note: the POSIX standard does not
 specify what should happen when you mix the 'g' and NUMBER
 modifiers, and currently there is no widely agreed upon meaning
 across 'sed' implementations.  For GNU 'sed', the interaction is
 defined to be: ignore matches before the NUMBERth, and then match
 and replace all matches from the NUMBERth on.

Upvotes: 0

Liu Tao
Liu Tao

Reputation: 2193

import os
os.system("sed -i 's/\t/_tab_/3g' " + file_path)

Does this work? Please notice that there is a -i argument for the above sed command, which is used to modify the input file inplace.

Upvotes: 0

Related Questions