Reputation: 3031
I know there are backslash posts, but their suggestions do not work for me. I am trying to capture everything that come after SUBJECT: and up to COMPANY (see below).
I'm using this code. notice the double backslashes \. but my output for the regex stops at CHI Children because of the backslash in 'CHI Children\'s'. What do I do to deal with this backslash that doesn't want to be caught?
indextext = re.findall(r'SUBJECT:\s+[A-Z\s\(\w+\%\)\;\&\:\-\,\/\\]+', udoc2)[0]
indextext = re.sub(r'\r\n','\n', indextext)
UPDATE: The reason I can't pre-specify 'COMPANY:' is because each document has a different word. Sometimes company doesn't exist. I would be forced to hard code dozens of exceptions.
udoc = [SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:]
current output:
SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS & STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%); VENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%); CHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP (78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS (78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT INNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%); SPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS (74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%); LABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE (62%) NY-GENYOUth-SAP; CHI Children
Upvotes: 0
Views: 142
Reputation: 54163
I don't like your approach, so I'm throwing it out the window. The last thing you want to do is to use regular expressions to match HUGE NUMBERS OF THINGS while you wait to get to just a few things. That's the exact opposite of what a regex should do: so don't you do it either.
I played with your code for quite awhile, trying to figure out exactly what you were trying to do and why. It seems to me you're trying to index those values somehow, something like {"ENTREPRENEURSHIP":93,"PRESS RELEASES":91,...}
, so that's what I built. Maybe that's not your end goal, in which case jeebus brother give us some feedback here....
text = """udoc = [SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:]"""
# sheesh that's a big string literal! Let's take a few lines to breathe.
# after all, we have to give the interpreter enough
# time
# to
# process all that
# data we just fed it
#
# ...right?
values = {' '.join(item[:-1]):item[-1].strip("(%)") for
full_list in text.split(":")[1:-1] for
element in full_list.split(";")[:-1] for
item in [element.strip().split()]}
for key,value in values.items():
print("{:35}: {}".format(key,value))
# AMERICAN FOOTBALL TOURNAMENTS : 74
# LABOR FORCE : 70
# BUSINESS ANALYTICS : 67
# COMPUTER SOFTWARE : 85
# CHARITIES : 78
# AMERICAN FOOTBALL : 74
# BUSINESS SOFTWARE (62%) : NY-GENYOUth-SAP
# FOUNDATIONS : 78
# CHILDREN : 78
# SPORTS : 74
# SPONSORSHIP : 78
# EDUCATION SYSTEMS & INSTITUTIONS : 78
# PUBLIC PRIVATE PARTNERSHIPS : 78
# SPORTS & RECREATION EVENTS : 74
# NUTRITION : 90
# ALLIANCES & PARTNERSHIPS : 77
# ENTERTAINMENT & ARTS : 77
# PRESS RELEASES : 91
# WORKPLACE PROGRAMS : 77
# VENTURE CAPITAL : 90
# CHI Children's Related : News
# STUDENTS & STUDENT LIFE : 90
# AGRICULTURE DEPARTMENTS : 73
# EXERCISE & FITNESS : 90
# ENTREPRENEURSHIP : 93
# NONPROFIT ORGANIZATIONS : 90
# PRODUCT INNOVATION : 77
# SPORTS FANS : 74
# PHILANTHROPY : 78
# LICENSING AGREEMENTS : 74
# PREVENTION & WELLNESS : 90
# EXECUTIVES : 70
Now I know what you're saying, "adsmith," you begin, "But look at the values in "CHI Children's Related" and "BUSINESS SOFTWARE (62%)," that's clearly wrong!!
I can't help your input being poorly formatted, no one can. CHI Children's Related
has a value of News
, that's not your fault and it's not my fault. They neglected to put a :
between BUSINESS SOFTWARE
and (62%)
, and we don't take the blame for that either.
On second thought, let's not go to the re
module. 'Tis a silly place.
Upvotes: 2
Reputation: 4940
Your question is a little vague so I am not totally sure what you are looking for
udoc = "SUBJECT: ENTREPRENEURSHIP (93%); PRESS RELEASES (91%); NUTRITION (90%); STUDENTS\r\n& STUDENT LIFE (90%); PREVENTION & WELLNESS (90%); EXERCISE & FITNESS (90%);\r\nVENTURE CAPITAL (90%); NONPROFIT ORGANIZATIONS (90%); COMPUTER SOFTWARE (85%);\r\nCHILDREN (78%); PUBLIC PRIVATE PARTNERSHIPS (78%); CHARITIES (78%); SPONSORSHIP\r\n(78%); FOUNDATIONS (78%); PHILANTHROPY (78%); EDUCATION SYSTEMS & INSTITUTIONS\r\n(78%); ALLIANCES & PARTNERSHIPS (77%); ENTERTAINMENT & ARTS (77%); PRODUCT\r\nINNOVATION (77%); WORKPLACE PROGRAMS (77%); SPORTS & RECREATION EVENTS (74%);\r\nSPORTS FANS (74%); AMERICAN FOOTBALL TOURNAMENTS (74%); LICENSING AGREEMENTS\r\n(74%); AMERICAN FOOTBALL (74%); SPORTS (74%); AGRICULTURE DEPARTMENTS (73%);\r\nLABOR FORCE (70%); EXECUTIVES (70%); BUSINESS ANALYTICS (67%); BUSINESS SOFTWARE\r\n(62%) NY-GENYOUth-SAP; CHI Children\'s Related News; LIC Licensing and Marketing\r\nAgreements\r\n\r\nCOMPANY:"
Notice the change from a list to a string
seems to me you are looking for everything between the colons
s = udoc.split(':')[1]
and then you might need to mess around with the individual items
mylist = [item for item in s.split(';')]
To clean them up a little
newlist = []
for item in mylist:
newlist.append(' '.join(item.split()))
you can get rid of the last word (COMPANY in this case) by some easy manipulation
newlist[-1] = ' '.join(newlist[-1].split()[:-1])
Finally if you want the results as a string just join newlist with some separator
Upvotes: 1
Reputation: 2023
You are not the first to bang your head here
http://docs.python.org/2/howto/regex.html#the-backslash-plague
You will need 4 backslashes to escape a backslash in your target string.
That said, I like using an interactive tool to perfect the regex, such as regex coach. http://www.weitz.de/regex-coach/
If you dont want to do the silly 4 backslashes, copy from your external tool and use re.compile(re.escape(string))
http://docs.python.org/2/library/re.html#re.escape
Upvotes: 1
Reputation: 3162
How about this?
(SUBJECT\:.*\:)
You can see how it works at http://regex101.com/r/aB7nJ2
Upvotes: 0
Reputation: 1904
You don't have to use regex for this. In this case it seems like there is a much easier solution.
Why not get the index of "COMPANY:]" and then get everything up to that?
Upvotes: 0