MRHT
MRHT

Reputation: 35

Split and use different parts of string using index or split-function

I'm having problems finding out how to use (even if it's possible) to split a string into different parts and sorting them by their different characteristics. Let me explain!

string = "Avititapin Kora 100 mg Coated tablet"

The string will always consist of three parts, although changing. "Avititapin" is the name of a product, and will change every time I run the script on different files. The same goes with "Kora", it's an extension of the name and might be a part of the name sometimes, and sometimes not. The next part of the string I'd like separated is the "100 mg". This indicating the strenght of the product, and this too will change. However the "mg"-part will almost never change and will always be a number (not really an integer or float as it's in a string). I''ve used this code to separate (or split) the string using the "mg" as a guide (to make it more usable with different set of strings), but it only prints the rest of the string after "mg".

string = "Avititapin Kora 100 mg Coated tablet"
mg = "mg"
after_mg = string[string.index(mg) + len(mg):]
print(after_mg)

Can someone point me in the right direction of how to write a code that let me sort the string into three different parts and store them each as variables?

I'm thinking using the fact that there will always be numbers in the string (as in this product 100 mg). Can I write something in style with "Before (numbers)100, 100 + (the next two letters, in this case "mg"), and then the rest of the string after the "mg", this too changes but the code I've already got works with this, so I know its working this way atleast.

I guess I have to declare that I'm totally new to coding, and every advise or bit of help is increadibly helpfull and appriciated! Am I thinking of this the wrong way or is this actually do-able?

Upvotes: 0

Views: 91

Answers (2)

RufusVS
RufusVS

Reputation: 4127

While I love regex, I think you overly complicate your solution. As it appears to be a space delimited string, why not just split() it, then take the pieces one at a time until you find the next field. (btw: please don't use string as a variable name, as it's a standard module name)

mystring = "Avititapin Kora 100 mg Coated tablet"
wordlist = mystring.split()

product = ''
dosage = ''
comments = ''

product = wordlist.pop(0)

word = wordlist.pop(0)
while not word.isnumeric():
   product += " "+word
   word = wordlist.pop(0)

dosage = word
word = wordlist.pop(0)

if word in ["g","mg"]:
   dosage += " "+word
   word = wordlist.pop(0)

comments = ' '.join([word,*wordlist])

print (product, dosage, comments, sep='\n')

This code is untested and typed in manually, so adjustments may be needed.

Upvotes: 1

Kraigolas
Kraigolas

Reputation: 5570

As @clubby789 pointed out, regex is a good way to solve this problem. However, the pattern for this can be a bit complicated. To make this simple, I've defined a function to help extract what you're looking for:

import re
def extract(line):
    pattern = re.compile(r"(.*?)\s*(\d+[.]{0,1}\d*)\s*(mg|g)\s+(.*?)$")
    result = pattern.match(line) # the entire match
    name = result.group(1) # "Avititapin Kora"
    amount = f"{result.group(2)} {result.group(3)}" # "100 mg"
    dose_type = result.group(4) # "Coated tablet"
    return name, amount, dose_type # return all 3 together

The results for each group are added in the comments. Let's look at the pattern:

(.*?)\s*(\d+[.]{0,1}\d*)\s*(mg|g)\s+(.*?)$

The first (.*?) says capture everything from the beginning of your string until we encounter the next part of the pattern, which is

\s*(\d+[.]{0,1}\d*)

This says match (but don't capture into a group) 0 or more spaces after the first part (that's the \s*). Inside of parentheses ( ) is what is captured. So, (\d+[.]{0,1}\d*) says capture all digits after the space, and potentially a decimal and more digits (ie. allow for integer and floating point numbers).

Next

\s*(mg|g)

Says match 0 or more spaces, and then capture your units. So, if you wanted to add kg, you would replace this with \s*(mg|g|kg). Then, after one or more spaces \s+ capture the remaining content until the end of a line: (.*?)$.

Let's test it:

test_data = """Avititapin Kora 100 mg Coated tablet
Avititapin 100 mg tablet
Avititapin Kora 100 g Coated tablet
Avititapin Kora 100.2 g Coated tablet
Avititapin Kora-24 100.2 g Coated tablet"""
for line in test_data.split("\n"):
    print(extract(line))

This returns

('Avititapin Kora', '100 mg', 'Coated tablet')
('Avititapin', '100 mg', 'tablet')
('Avititapin Kora', '100 g', 'Coated tablet')
('Avititapin Kora', '100.2 g', 'Coated tablet')
('Avititapin Kora-24', '100.2 g', 'Coated tablet')

Upvotes: 3

Related Questions