Reputation: 3031

Regex that captures backslash

I know there are backslash posts, but their suggestions do not work for me. I am trying to capture everything that come after SUBJECT: and up to COMPANY (see below).

I'm using this code. notice the double backslashes \. but my output for the regex stops at CHI Children because of the backslash in 'CHI Children\'s'. What do I do to deal with this backslash that doesn't want to be caught?

indextext = re.findall(r'SUBJECT:\s+[A-Z\s\(\w+\%\)\;\&\:\-\,\/\\]+', udoc2)[0]
indextext = re.sub(r'\r\n','\n', indextext)

UPDATE: The reason I can't pre-specify 'COMPANY:' is because each document has a different word. Sometimes company doesn't exist. I would be forced to hard code dozens of exceptions.

udoc = [SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:]

current output:

SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS & STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%); VENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%); CHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP (78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS (78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT INNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%); SPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS (74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%); LABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE (62%) NY-GENYOUth-SAP; CHI Children

Upvotes: 0

Answers (6)

Adam Smith

Reputation: 54163

My Big Huge Caveat:

I don't like your approach, so I'm throwing it out the window. The last thing you want to do is to use regular expressions to match HUGE NUMBERS OF THINGS while you wait to get to just a few things. That's the exact opposite of what a regex should do: so don't you do it either.

My Big Huge Assumption:

I played with your code for quite awhile, trying to figure out exactly what you were trying to do and why. It seems to me you're trying to index those values somehow, something like {"ENTREPRENEURSHIP":93,"PRESS RELEASES":91,...}, so that's what I built. Maybe that's not your end goal, in which case jeebus brother give us some feedback here....

My Itty Bitty Code:

text = """udoc = [SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:]"""
# sheesh that's a big string literal! Let's take a few lines to breathe.
# after all, we have to give the interpreter enough
# time
# to
# process all that
# data we just fed it
#
# ...right?

values = {' '.join(item[:-1]):item[-1].strip("(%)") for
  full_list in text.split(":")[1:-1] for
  element in full_list.split(";")[:-1] for
  item in [element.strip().split()]}

for key,value in values.items():
    print("{:35}: {}".format(key,value))

# AMERICAN FOOTBALL TOURNAMENTS      : 74
# LABOR FORCE                        : 70
# BUSINESS ANALYTICS                 : 67
# COMPUTER SOFTWARE                  : 85
# CHARITIES                          : 78
# AMERICAN FOOTBALL                  : 74
# BUSINESS SOFTWARE (62%)            : NY-GENYOUth-SAP
# FOUNDATIONS                        : 78
# CHILDREN                           : 78
# SPORTS                             : 74
# SPONSORSHIP                        : 78
# EDUCATION SYSTEMS & INSTITUTIONS   : 78
# PUBLIC PRIVATE PARTNERSHIPS        : 78
# SPORTS & RECREATION EVENTS         : 74
# NUTRITION                          : 90
# ALLIANCES & PARTNERSHIPS           : 77
# ENTERTAINMENT & ARTS               : 77
# PRESS RELEASES                     : 91
# WORKPLACE PROGRAMS                 : 77
# VENTURE CAPITAL                    : 90
# CHI Children's Related             : News
# STUDENTS & STUDENT LIFE            : 90
# AGRICULTURE DEPARTMENTS            : 73
# EXERCISE & FITNESS                 : 90
# ENTREPRENEURSHIP                   : 93
# NONPROFIT ORGANIZATIONS            : 90
# PRODUCT INNOVATION                 : 77
# SPORTS FANS                        : 74
# PHILANTHROPY                       : 78
# LICENSING AGREEMENTS               : 74
# PREVENTION & WELLNESS              : 90
# EXECUTIVES                         : 70

Now I know what you're saying, "adsmith," you begin, "But look at the values in "CHI Children's Related" and "BUSINESS SOFTWARE (62%)," that's clearly wrong!!

I can't help your input being poorly formatted, no one can. CHI Children's Related has a value of News, that's not your fault and it's not my fault. They neglected to put a : between BUSINESS SOFTWARE and (62%), and we don't take the blame for that either.

Conclusion

On second thought, let's not go to the re module. 'Tis a silly place.

Upvotes: 2

PyNEwbie

Reputation: 4940

Your question is a little vague so I am not totally sure what you are looking for

udoc = "SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:"

Notice the change from a list to a string

seems to me you are looking for everything between the colons

s = udoc.split(':')[1]

and then you might need to mess around with the individual items

mylist = [item for item in s.split(';')]

To clean them up a little

newlist = []
for item in mylist:
    newlist.append(' '.join(item.split()))

you can get rid of the last word (COMPANY in this case) by some easy manipulation

newlist[-1] = ' '.join(newlist[-1].split()[:-1])

Finally if you want the results as a string just join newlist with some separator

Upvotes: 1

bauman.space

Reputation: 2023

You are not the first to bang your head here

http://docs.python.org/2/howto/regex.html#the-backslash-plague

You will need 4 backslashes to escape a backslash in your target string.

That said, I like using an interactive tool to perfect the regex, such as regex coach. http://www.weitz.de/regex-coach/

If you dont want to do the silly 4 backslashes, copy from your external tool and use re.compile(re.escape(string))

http://docs.python.org/2/library/re.html#re.escape

Upvotes: 1

Stepan Grigoryan

Reputation: 3162

How about this?

(SUBJECT\:.*\:)

You can see how it works at http://regex101.com/r/aB7nJ2

Upvotes: 0

John Dorian

Reputation: 1904

You don't have to use regex for this. In this case it seems like there is a much easier solution.

Why not get the index of "COMPANY:]" and then get everything up to that?

Upvotes: 0

user590028

Reputation: 11730

Why not:

import re
re.search(r'SEARCH:(.+)COMPANY:', udoc2)

Upvotes: 0

Regex that captures backslash

Answers (6)

My Big Huge Caveat:

My Big Huge Assumption:

My Itty Bitty Code:

Conclusion

Related Questions