ryoishikawa74
ryoishikawa74

Reputation: 177

Split text using any suggested method

I have a plain text like this:

Cart ID: A3N42M / Copy: A3N42P PO: 5000021337 Invoice: 3110021337
Cart ID: A3N3ZW / Copy: A3N3ZX/ PO: 5000021335 Invoice: 3110021335 
Cart ID: A3N3ZL / Copy: A3N3ZM PO: 5000021336 Invoice: 3110021336 
Original: A3N444 / Copy: A3N445 PO: 5000021340 Invoice: 3130021340
Original: A3N44C / Copy: A3N44D PO: 5000021341 Invoice: 3130021341
Original: A3N44G / Copy: A3N44H PO: 5000021342 Invoice: 3130021342
Cart ID: A3N3ZZ / Copy: A3N428 PO: A3N3ZZ01
Cart ID: A3N3ZQ / Copy: A3N3ZV PO: A3N3ZQ01
Cart ID: A3N336 / Copy: A3N337 PO: A3N33601
E3M49D / E3M49Q - PO: 4620028049
E3M49N / E3M49X
E3M49P / E3M49Y

And I need to split the cart IDs from that text, and end up in something like this:

A3N42M
A3N42P
A3N3ZW
...
E3M49N
...

These IDs are always 6-digit length, and always starts with A, E or P (AXXXXX, EXXXXX, PXXXXX, etc...).

Is there any way (using any script or any program language) that I can achieve this?

This plain text is currently on a spreadsheet table, and I need to separate these IDs to use in a SQL query for later, thanks!

Upvotes: 0

Views: 45

Answers (1)

tmthydvnprt
tmthydvnprt

Reputation: 10748

Regex Data-Munging


You can perform this via regex substitutions (or find and replace). I happened to do this in python, but you could use the same regex patterns ('^[^/]*?: ', '/.*') in any language or text editor that supports them.

Regex Explanation

  1. '^[^/]*?: ' - start and beginning of string (first ^), match multiple non-\ characters in a non greedy way ([^/]*?]), a :, then a .
  2. '/.*' - match all /, then multiple any character (.)

Processing Example (in Python)

import re

text = '''
Cart ID: A3N42M / Copy: A3N42P PO: 5000021337 Invoice: 3110021337
Cart ID: A3N3ZW / Copy: A3N3ZX/ PO: 5000021335 Invoice: 3110021335 
Cart ID: A3N3ZL / Copy: A3N3ZM PO: 5000021336 Invoice: 3110021336 
Original: A3N444 / Copy: A3N445 PO: 5000021340 Invoice: 3130021340
Original: A3N44C / Copy: A3N44D PO: 5000021341 Invoice: 3130021341
Original: A3N44G / Copy: A3N44H PO: 5000021342 Invoice: 3130021342
Cart ID: A3N3ZZ / Copy: A3N428 PO: A3N3ZZ01
Cart ID: A3N3ZQ / Copy: A3N3ZV PO: A3N3ZQ01
Cart ID: A3N336 / Copy: A3N337 PO: A3N33601
E3M49D / E3M49Q - PO: 4620028049
E3M49N / E3M49X
E3M49P / E3M49Y
'''

text = re.sub('^[^/]*?: ([]*?)', '', text, flags=re.MULTILINE)
text = re.sub('/.*', '', text)

print text

A3N42M 
A3N3ZW 
A3N3ZL 
A3N444 
A3N44C 
A3N44G 
A3N3ZZ 
A3N3ZQ 
A3N336 
E3M49D 
E3M49N 
E3M49P

EDIT


Updated Regex Explanation

Updated regex per author's request.

  1. match any character (.*?) (non-greedy), followed by a capture group of ((...)) of a character class ([AEP]) followed by at least one character class ([0-9]+) followed by four word characters (\w{4}), followed by another capture group of any character ((.*))
  2. All of this is matched and replaced with the capture group variables with a newline in between (\1\n\2), essentially splitting rows where IDs occur twice
  3. Repeat steps with only the first capture group to handle the newlines containing the second ID

Updated Processing

text = re.sub(r'.*?([AEP][0-9]+\w{4})(.*)', r'\1\n\2', text, flags=re.MULTILINE)
text = re.sub(r'.*?([AEP][0-9]+\w{4}).*', r'\1', text, flags=re.MULTILINE)

print text

A3N42M
A3N42P
A3N3ZW
A3N3ZX
A3N3ZL
A3N3ZM
A3N444
A3N445
A3N44C
A3N44D
A3N44G
A3N44H
A3N3ZZ
A3N428
A3N3ZQ
A3N3ZV
A3N336
A3N337
E3M49D
E3M49Q
E3M49N
E3M49X
E3M49P
E3M49Y

Upvotes: 1

Related Questions