Reputation: 2897
I am trying to get a reference number inside a string which is in most cases precedented by "Ref." or something similar.
e.g.:
Explorer II Ref.16570 Box
The problem is that there are many different variations1 as this is user generated content. How could I retrieve the number with python which is precented by e.g. Ref.
?
The number/string is not always the same pattern e.g. numbers. .They might be mixed with characters and dots and slashes but for a human eye there is almost always such a number in each line identifiable.
E.g.:
Ref.16570
Ref. 16570
Referenz 216570
Referenz 01 733 7653 4159-07 4 26
331.12.42.51.01.002
166.0173
AB012012/BB01
Ref. 167.021
PAM00292
14000M
L3.642.4.56.6
161.559.50
801
666
753
116400GV
Ref.: 231.10.39.21.03.002
3233
Ref: 233.32.41.21.01.002
T081.420.97.057.01
16750
... almost each line in the example provided contains a certain ID
A small amount of false positives would not be a problem.
Upvotes: 1
Views: 6282
Reputation:
Try the following code. It collects all the data after Ref
till one of pre-defined stoppers. Stoppers are used because the question does not contain clear definition of what data is reference (not always the same pattern
, might be mixed with
, for a human eye there is almost always
). I guess additional processing of matches is needed to extract actual references more accurately.
import re
ref_re = re.compile('(?P<ref_keyword>Referenz|Ref\.|Ref)[ ]*(?P<ref_value>.*?)(?P<ref_stopper> - | / |,|\n)')
with open('1.txt', mode='r', encoding='UTF-8') as file:
data = file.read()
for match in ref_re.finditer(data):
print('key:', match.group('ref_keyword'))
print('value:', match.group('ref_value'))
# print('stopper:', match.group('ref_stopper'))
Output starts with the lines:
key: Ref.
value: 16570 Box&Papiere mit Revision
key: Ref.
value: 16570 Box&Papiere mit Revision
key: Referenz
value: 216570 mit schwarzem Zifferblatt
key: Referenz
value: 01 733 7653 4159-07 4 26 34EB
key: Ref.
value: 167.021
key: Ref.
value: 3527
key: Referenz
value: 01 733 7653 4159-07 4 26 34EB
key: Ref.
value: 16570 Box&Papiere mit Revision
Upvotes: 0
Reputation: 49
This ought to do the trick:
import re
str = 'Explorer II Ref.16570 Box'
m = re.match('Ref\.[0-9]+', str)
if m:
print(m.group(0)[4:])
For more info:
Upvotes: 0
Reputation: 98861
Not totally sure if you need to match
or extract
, but Ref\.?([ \d.]+)
will extract any digits after Ref
(case insensitive), i.e.:
import re
result = re.findall(r"Ref\.?([ \d.]+)", subject, re.IGNORECASE | re.MULTILINE)
['16570', '16570', '167.021', '3527']
Upvotes: 1