Reputation: 259
I have a string like below:
Features: -Includes hanging accessories. -Artist: William-Adolphe Bouguereau. -Made with 100pct cotton canvas. -100pct Anti-shrink pine wood bars and Epson anti-fade ultra chrome inks. -100pct Hand-made and inspected in the U.S.A. -Orientation: Horizontal. **Subject: -Figures/Nautical and beach.** Gender: -Unisex/Both. Size: -Mini 17'' and under/Small 18''-24''/Medium 25''-32''/Large 33''-40''/Oversized 41'' and above. Style: -Fine art. Color: -Blue. Country of Manufacture: -United States. Product Type: -Print of painting. Region: -Europe. Primary Art Material: -Canvas. Dimensions: -8'' H x 12'' W x 0.75'' D: 0.72 lb. -12'' H x 18'' W x 0.75'' D: 1.14 lbs. -12'' H x 18'' W x 1.5'' D: 2.45 lbs. -18'' H x 26'' W x 0.75'' D: 1.44 lbs. Paintings Prints Tori White Wildon Photography Photos Posters Abstract Black D cor Designs Framed Hazelwood Hokku Home Landscape Oil Accent 075 12 15 18 26 40 60 8 D H W x 1 1017 1824 2532 holidays, christmas gift gifts for girls boys
I have to find the words after particular word.
I want to extract the words after the word "Subject"
in above example.
The output should be like below:
Subject: -Figures/Nautical and beach.
I tried below regex:
re.compile('(?<=subject)(.{30}(?:\s|.))',re.I)
But there is not fixed number of words after subject keyword to specify so I can't specify exact number of words.
How do I stop at "peroid" or space.There is no specific stopping criterion.
Upvotes: 1
Views: 13492
Reputation: 4196
Regex:
(Subject:.+)\*\*
Match Subject and content after that till '**'
Code:
str = 'Features: -Includes hanging accessories. -Artist: William-Adolphe Bouguereau. -Made with 100pct cotton canvas. -100pct Anti-shrink pine wood bars and Epson anti-fade ultra chrome inks. -100pct Hand-made and inspected in the U.S.A. -Orientation: Horizontal. **Subject: -Figures/Nautical and beach.** Gender: -Unisex/Both. Size: -Mini 17'' and under/Small 18''-24''/Medium 25''-32''/Large 33''-40''/Oversized 41'' and above. Style: -Fine art. Color: -Blue. Country of Manufacture: -United States. Product Type: -Print of painting. Region: -Europe. Primary Art Material: -Canvas. Dimensions: -8'' H x 12'' W x 0.75'' D: 0.72 lb. -12'' H x 18'' W x 0.75'' D: 1.14 lbs. -12'' H x 18'' W x 1.5'' D: 2.45 lbs. -18'' H x 26'' W x 0.75'' D: 1.44 lbs. Paintings Prints Tori White Wildon Photography Photos Posters Abstract Black D cor Designs Framed Hazelwood Hokku Home Landscape Oil Accent 075 12 15 18 26 40 60 8 D H W x 1 1017 1824 2532 holidays, christmas gift gifts for girls boys'
import re
a = re.search(r'(Subject:.+)\*\*',str)
print(a.group(1))
Upvotes: 0
Reputation: 626799
Your (?<=subject)(.{30}(?:\s|.))
regex asserts the position after subject
. then grabs 30 characters other than a linebreak symbol and then matches either a whitespace or any character but a linebreak symbol. This does not really fit your requirements as the substring can be of any length.
You may use alternation based regex with a capturing group:
subject:\s*([^.]+|\S+)
See the regex demo
Details:
subject:
- literal subject:
string\s*
- 0+ whitespaces([^.]+|\S+)
- Group 1 capturing 1 or more non-period symbols or 1+ non-whitespace symbolsNote: the order of the alternatives matters here since [^.]+
matches spaces, and \S+
does not. If the substring after \s*
starts with a dot, the \S+
will match that substring up to a whitespace.
import re
p = re.compile(r'subject:\s*([^.]+|\S+)', re.IGNORECASE)
s = "Features: -Includes hanging accessories. -Artist: William-Adolphe Bouguereau. -Made with 100pct cotton canvas. -100pct Anti-shrink pine wood bars and Epson anti-fade ultra chrome inks. -100pct Hand-made and inspected in the U.S.A. -Orientation: Horizontal. **Subject: -Figures/Nautical and beach.** Gender: -Unisex/Both. Size: -Mini 17'' and under/Small 18''-24''/Medium 25''-32''/Large 33''-40''/Oversized 41'' and above. Style: -Fine art. Color: -Blue. Country of Manufacture: -United States. Product Type: -Print of painting. Region: -Europe. Primary Art Material: -Canvas. Dimensions: -8'' H x 12'' W x 0.75'' D: 0.72 lb. -12'' H x 18'' W x 0.75'' D: 1.14 lbs. -12'' H x 18'' W x 1.5'' D: 2.45 lbs. -18'' H x 26'' W x 0.75'' D: 1.44 lbs. Paintings Prints Tori White Wildon Photography Photos Posters Abstract Black D cor Designs Framed Hazelwood Hokku Home Landscape Oil Accent 075 12 15 18 26 40 60 8 D H W x 1 1017 1824 2532 holidays, christmas gift gifts for girls boys"
m = p.search(s)
if m:
print(m.group()) # this includes Subject:
print(m.group(1)) # this does not include Subject:
Upvotes: 2