Reputation: 177
I have a plain text like this:
Cart ID: A3N42M / Copy: A3N42P PO: 5000021337 Invoice: 3110021337
Cart ID: A3N3ZW / Copy: A3N3ZX/ PO: 5000021335 Invoice: 3110021335
Cart ID: A3N3ZL / Copy: A3N3ZM PO: 5000021336 Invoice: 3110021336
Original: A3N444 / Copy: A3N445 PO: 5000021340 Invoice: 3130021340
Original: A3N44C / Copy: A3N44D PO: 5000021341 Invoice: 3130021341
Original: A3N44G / Copy: A3N44H PO: 5000021342 Invoice: 3130021342
Cart ID: A3N3ZZ / Copy: A3N428 PO: A3N3ZZ01
Cart ID: A3N3ZQ / Copy: A3N3ZV PO: A3N3ZQ01
Cart ID: A3N336 / Copy: A3N337 PO: A3N33601
E3M49D / E3M49Q - PO: 4620028049
E3M49N / E3M49X
E3M49P / E3M49Y
And I need to split the cart IDs from that text, and end up in something like this:
A3N42M
A3N42P
A3N3ZW
...
E3M49N
...
These IDs are always 6-digit length, and always starts with A
, E
or P
(AXXXXX
, EXXXXX
, PXXXXX
, etc...).
Is there any way (using any script or any program language) that I can achieve this?
This plain text is currently on a spreadsheet table, and I need to separate these IDs to use in a SQL query for later, thanks!
Upvotes: 0
Views: 45
Reputation: 10748
You can perform this via regex
substitutions (or find and replace). I happened to do this in python, but you could use the same regex patterns ('^[^/]*?: '
, '/.*'
) in any language or text editor that supports them.
'^[^/]*?: '
- start and beginning of string (first ^
), match multiple non-\
characters in a non greedy way ([^/]*?]
), a :
, then a
.'/.*'
- match all /
, then multiple any character (.
)import re
text = '''
Cart ID: A3N42M / Copy: A3N42P PO: 5000021337 Invoice: 3110021337
Cart ID: A3N3ZW / Copy: A3N3ZX/ PO: 5000021335 Invoice: 3110021335
Cart ID: A3N3ZL / Copy: A3N3ZM PO: 5000021336 Invoice: 3110021336
Original: A3N444 / Copy: A3N445 PO: 5000021340 Invoice: 3130021340
Original: A3N44C / Copy: A3N44D PO: 5000021341 Invoice: 3130021341
Original: A3N44G / Copy: A3N44H PO: 5000021342 Invoice: 3130021342
Cart ID: A3N3ZZ / Copy: A3N428 PO: A3N3ZZ01
Cart ID: A3N3ZQ / Copy: A3N3ZV PO: A3N3ZQ01
Cart ID: A3N336 / Copy: A3N337 PO: A3N33601
E3M49D / E3M49Q - PO: 4620028049
E3M49N / E3M49X
E3M49P / E3M49Y
'''
text = re.sub('^[^/]*?: ([]*?)', '', text, flags=re.MULTILINE)
text = re.sub('/.*', '', text)
print text
A3N42M
A3N3ZW
A3N3ZL
A3N444
A3N44C
A3N44G
A3N3ZZ
A3N3ZQ
A3N336
E3M49D
E3M49N
E3M49P
Updated regex
per author's request.
.*?
) (non-greedy), followed by a capture group of ((...)
) of a character class ([AEP]
) followed by at least one character class ([0-9]+
) followed by four word characters (\w{4}
), followed by another capture group of any character ((.*)
)\1\n\2
), essentially splitting rows where IDs occur twicetext = re.sub(r'.*?([AEP][0-9]+\w{4})(.*)', r'\1\n\2', text, flags=re.MULTILINE)
text = re.sub(r'.*?([AEP][0-9]+\w{4}).*', r'\1', text, flags=re.MULTILINE)
print text
A3N42M
A3N42P
A3N3ZW
A3N3ZX
A3N3ZL
A3N3ZM
A3N444
A3N445
A3N44C
A3N44D
A3N44G
A3N44H
A3N3ZZ
A3N428
A3N3ZQ
A3N3ZV
A3N336
A3N337
E3M49D
E3M49Q
E3M49N
E3M49X
E3M49P
E3M49Y
Upvotes: 1