Reputation: 244
I manage to use pytesseract to convert an invoice image into text.
The multi-line string looks like this:
Receipt No: 20191220.001
Date: 20 December 2019
Invoice amount: $400.00
I would like to extract invoice number, just the number (i.e.: 20191220.001) using substring. I manage to get the start index through index = string.find('Receipt No: ')
but when I use the substring function to extract the number print(string[index:])
I got the following result:
20191220.001
Date: 20 December 2019
Invoice amount: $400.00
But I only wanted to extract the first line. The invoice numbers are not defined at only 12 characters, there might be more or less depending on the vendor. How do I only extract the invoice number? I'm doing this to automate an accounting process.
Upvotes: 1
Views: 726
Reputation: 255
If you only care about the first line, you can find the first occurence of line ending character as the end of your number. Notice that the start of your number is the end of the substring ("Receipt No: ") while find function return the start of the substring.
string = '''Receipt No: 20191220.001
Date: 20 December 2019
Invoice amount: $400.00'''
sub = 'Receipt No: '
start = string.find(sub) + len(sub)
end = string.find('\n')
print(string[start:end])
If you also care about other lines. You can use split and process each line separately.
lines = string.split('\n')
sub = 'Receipt No: '
index = lines[0].find(sub) + len(sub)
print(lines[0][index:])
# Process line 1
# Process line 2
Upvotes: 0
Reputation: 12417
You can use split
:
s = '''Receipt No: 20191220.001
Date: 20 December 2019
Invoice amount: $400.00'''
number = s.split('Receipt No: ')[1].split('\n')[0]
print(number)
Output:
20191220.001
Or if you want to use find
, you can do in this way:
index1 = s.find(':')
index2 = s.find('\n')
print(s[index1+1:index2].strip())
Upvotes: 1
Reputation: 24
You may try with split function.
with open("filename",'r') as dataload:
for i in dataload.readlines():
if "Receipt No:" in i:
print(i.split(":")[1].strip())
output-
20191220.001
if "Receipt No:" in i: ---> you can change if "**" parameter as per your requirement
Upvotes: 0
Reputation: 369
Separate your string in a list with split by "\n" You will get each part of a string separated by newline as a list element. You can then take the part you want
string = """Receipt No: 20191220.001
Date: 20 December 2019
Invoice amount: $400.00"""
your_list = string.split("\n")
data = your_list[0]
Upvotes: 0
Reputation: 4215
Try:
import re
s = """
Receipt No: 20191220.001
Date: 20 December 2019
Invoice amount: $400.00"""
p = re.compile("Receipt No\: (\d+.\d+)")
result = p.search(s)
index = result.group(1) #'20191220.001'
Upvotes: 0