Reputation: 41
I'm looking to find if there is a way of automating this process. Basically I have 300,000 rows of data needed to download on a daily basis. There are a couple of rows that need to be edited before it can be uploaded to SQL.
Jordan || Michael | 23 | Bulls | Chicago
Bryant | Kobe ||| 8 || LA
What I want to accomplish is to just have 4 vertical bars per row. Normally, I would search for a keyword then edit it manually then save. These two are the only anomalies in my data.
Correct format is below -
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA
Not sure if this can be done in vbscript or Python. Any help would be appreciated. Thanks!
Upvotes: 2
Views: 79
Reputation: 113814
Python or vbscript could be used but they are overkill for something this simple. Try sed
:
$ sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA
To save to a new file:
sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file >newfile
Or, to change the existing file in-place:
sed -Ei.bak 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file
sed reads and processes a file line by line. In our case, we need only the substitute command which has the form s/old/new/g
where old
is a regular expression and, if it is found, it is replaced by new
. The optional g
at the end of the command tells sed to perform the substitution command 'globally', meaning not just once but as many times as it appears on the line.
s/(Jordan *)\|/\1/g
This tells sed to look for Jordan followed by zero or more spaces followed by a vertical bar and remove the vertical bar.
In more detail, the parens in (Jordan *)
tell sed to save the string Jordan followed by zero or more spaces as a group. In the replacement side, we reference that group as \1
.
s/(Kobe *)\| *\|/\1/g
Similarly, this tells sed to look for Kobe followed by zero or more spaces followed by a vertical bar and remove the vertical bar.
Using the same logic as above, here is a python program:
$ cat kobe.py
import re
with open('file') as f:
for line in f:
line = re.sub(r'(Jordan *)\|', r'\1', line)
line = re.sub(r'(Kobe *)\| *\|', r'\1', line)
print(line.rstrip('\n'))
$ python kobe.py
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA
To save that to a new file:
python kobe.py >newfile
Upvotes: 2
Reputation: 630
I wrote a code snippet in Python 3.5 as follows.
# -*- coding: utf-8 -*-
rows = ["Jordan||Michael|23|Bulls|Chicago",
"Bryant|Kobe|||8||LA"]
keywords = ["Jordan", "Kobe"]
def get_keyword(row, keywords):
for word in keywords:
if word in row:
return word
else:
return None
for line in rows:
num_bars = line.count('|')
num_bars_del = num_bars - 4 # Number of bars to be deleted
kw = get_keyword(line, keywords)
if kw: # this line contains a keyword
# Split the line by the keyword
first, second = line.split(kw)
second = second.lstrip()
result = "%s%s%s"%(first, kw, second[num_bars_del:])
print(result)
Upvotes: 1