User123
User123

Reputation: 833

Extract substrings separately from a string using python regex

I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".

a='''S
LINC             SHORT LEGAL                                   TITLE NUMBER
0037 471 661     1720278;16;21                                 172 211 342

LEGAL DESCRIPTION
PLAN 1720278  
BLOCK 16  
LOT 21  
EXCEPTING THEREOUT ALL MINES AND MINERALS  

ESTATE: FEE SIMPLE  
ATS REFERENCE: 4;24;54;2;SW

MUNICIPALITY: CITY OF EDMONTON

REFERENCE NUMBER: 172 023 641 +71

---------------------------------------------------------------------------- 
----
             REGISTERED OWNER(S)
REGISTRATION    DATE(DMY)  DOCUMENT TYPE      VALUE           CONSIDERATION
--------------------------------------------------------------------------- 
-- 
---

172 211 342    15/08/2017  AFFIDAVIT OF                       CASH & MTGE'''

Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?

Here is the expression I have pieced together so far:

doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF                       CASH & MTGE'

Upvotes: 3

Views: 632

Answers (11)

Kami Kaze
Kami Kaze

Reputation: 611

Your problem is that your string is formatted the way it is. The line you are looking for is

182 246 612 01/10/2018 PHASED OF CASH & MTGE

And then you are looking for what ever comes after 'PHASED OF' and some spaces.

You want to search for

(?<=PHASED OF)\s*(?P.*?)\n

in your string. This will return a match object containing the value you are looking for in the group value.

m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')

Also: There are many good online regex testers to fiddle around with your regexes. And only after finishing up the regex just copy and paste it into python.

I use this one: https://regex101.com/

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521259

We can try using re.findall with the following pattern:

PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)

Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.

input = "182 246 612    01/10/2018  PHASED OF                           CASH & MTGE\n        CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)

CASH & MTGE

Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.

Upvotes: 3

PIG
PIG

Reputation: 602

positive lookbehind assertion**

 m=re.search('(?<=15/08/2017).*', a)
 m.group(0)

Upvotes: 1

Sharad
Sharad

Reputation: 10602

re based code snippet

import re
foo = '''S
LINC             SHORT LEGAL                                   TITLE NUMBER
0037 471 661     1720278;16;21                                 172 211 342

LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS

ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW

MUNICIPALITY: CITY OF EDMONTON

REFERENCE NUMBER: 172 023 641 +71

----------------------------------------------------------------------------
----
             REGISTERED OWNER(S)
REGISTRATION    DATE(DMY)  DOCUMENT TYPE      VALUE           CONSIDERATION
---------------------------------------------------------------------------
--
---

172 211 342    15/08/2017  AFFIDAVIT OF                       CASH & MTGE'''

pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]

Output

1st match:  AFFIDAVIT OF
2nd match:  CASH & MTGE

Upvotes: 3

CodeIt
CodeIt

Reputation: 3618

Not a regex based solution. But does the trick.

a='''S
LINC             SHORT LEGAL                                   TITLE NUMBER
0037 471 661     1720278;16;21                                 172 211 342

LEGAL DESCRIPTION
PLAN 1720278  
BLOCK 16  
LOT 21  
EXCEPTING THEREOUT ALL MINES AND MINERALS  

ESTATE: FEE SIMPLE  
ATS REFERENCE: 4;24;54;2;SW

MUNICIPALITY: CITY OF EDMONTON

REFERENCE NUMBER: 172 023 641 +71

---------------------------------------------------------------------------- 
----
            REGISTERED OWNER(S)
REGISTRATION    DATE(DMY)  DOCUMENT TYPE      VALUE           CONSIDERATION
--------------------------------------------------------------------------- 
-- 
---

172 211 342    15/08/2017  AFFIDAVIT OF                       CASH & MTGE'''

doc = (a.split('15/08/2017', 1)[1]).strip() 
# used split with two white spaces instead of one to get the desired result
print(doc.split("  ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split("  ")[-1].strip()) # outputs CASH & MTGE

Hope it helps.

Upvotes: 3

Muhammad Bilal
Muhammad Bilal

Reputation: 2134

You can do this by using group(1)

re.match("(.*?)15/08/2017",a).group(1)

UPDATE

For updated string you can use .search instead of .match

re.search("(.*?)15\/08\/2017",a).group(1)

Upvotes: 0

silverhash
silverhash

Reputation: 919

Building on your expression, this is what I believe you need:

import re

a='172 211 342    15/08/2017  TRANSFER OF LAND   $610,000        CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)

Output:

'172 211 342    '

Upvotes: 0

alecxe
alecxe

Reputation: 473873

Why regular expressions?

It looks like you know the exact delimiting string, just str.split() by it and get the first part:

In [1]: a='172 211 342    15/08/2017  TRANSFER OF LAND   $610,000        CASH & MTGE'

In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342    '

Upvotes: 2

LOrD_ARaGOrN
LOrD_ARaGOrN

Reputation: 4496

You nede to use group(1)

import re
re.match("(.*?)15/08/2017",a).group(1)

Output

'172 211 342    '

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521259

I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.

import re
a = "172 211 342    15/08/2017  TRANSFER OF LAND   $610,000        CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)

for i in range(1, len(parts)):
    if (parts[i] == "15/08/2017"):
        print(parts[i-1])

['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '$610,000', 'CASH & MTGE']
172 211 342

Upvotes: 1

RoyaumeIX
RoyaumeIX

Reputation: 1977

You have to return the right group:

re.match("(.*?)15/08/2017",a).group(1)

Upvotes: 0

Related Questions