James Vici
James Vici

Reputation: 41

Find and Edit Text File

I'm looking to find if there is a way of automating this process. Basically I have 300,000 rows of data needed to download on a daily basis. There are a couple of rows that need to be edited before it can be uploaded to SQL.

Jordan || Michael | 23 | Bulls | Chicago

Bryant | Kobe ||| 8 || LA

What I want to accomplish is to just have 4 vertical bars per row. Normally, I would search for a keyword then edit it manually then save. These two are the only anomalies in my data.

  1. Find "Jordan", then remove the excess 1 vertical bar "|" right after it.
  2. I need to find "Kobe", then remove the two excess vertical bars "|" right after it.

Correct format is below -

Jordan | Michael | 23 | Bulls | Chicago

Bryant | Kobe | 8 || LA

Not sure if this can be done in vbscript or Python. Any help would be appreciated. Thanks!

Upvotes: 2

Views: 79

Answers (2)

John1024
John1024

Reputation: 113814

Python or vbscript could be used but they are overkill for something this simple. Try sed:

$ sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file 
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA

To save to a new file:

sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file >newfile

Or, to change the existing file in-place:

sed -Ei.bak 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file 

How it works

sed reads and processes a file line by line. In our case, we need only the substitute command which has the form s/old/new/g where old is a regular expression and, if it is found, it is replaced by new. The optional g at the end of the command tells sed to perform the substitution command 'globally', meaning not just once but as many times as it appears on the line.

  • s/(Jordan *)\|/\1/g

    This tells sed to look for Jordan followed by zero or more spaces followed by a vertical bar and remove the vertical bar.

    In more detail, the parens in (Jordan *) tell sed to save the string Jordan followed by zero or more spaces as a group. In the replacement side, we reference that group as \1.

  • s/(Kobe *)\| *\|/\1/g

    Similarly, this tells sed to look for Kobe followed by zero or more spaces followed by a vertical bar and remove the vertical bar.

Using python

Using the same logic as above, here is a python program:

$ cat kobe.py
import re
with open('file') as f:
    for line in f:
        line = re.sub(r'(Jordan *)\|', r'\1', line)
        line = re.sub(r'(Kobe *)\| *\|', r'\1', line)
        print(line.rstrip('\n'))
$ python kobe.py
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA

To save that to a new file:

python kobe.py >newfile

Upvotes: 2

Daewon Lee
Daewon Lee

Reputation: 630

I wrote a code snippet in Python 3.5 as follows.

# -*- coding: utf-8 -*-

rows = ["Jordan||Michael|23|Bulls|Chicago",
        "Bryant|Kobe|||8||LA"]

keywords = ["Jordan", "Kobe"]        

def get_keyword(row, keywords):
    for word in keywords:
        if word in row:
            return word
    else:
        return None            

for line in rows:
    num_bars = line.count('|')
    num_bars_del = num_bars - 4  # Number of bars to be deleted
    kw = get_keyword(line, keywords)
    if kw:  # this line contains a keyword
        # Split the line by the keyword
        first, second = line.split(kw)
        second = second.lstrip()
        result = "%s%s%s"%(first, kw, second[num_bars_del:])
        print(result)

Upvotes: 1

Related Questions