Reputation: 1913
I am looking for matches in the following string of text:
'<html xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mso="urn:schemas-microsoft-com:office:office">\n <head>\n <meta charset="utf-8"/>\n <title>\n SN G2250-010\n </title>\n <!--[if gte mso 9]><xml>\n<mso:CustomDocumentProperties>\r\n<mso:Service_x0020_Note msdt:dt="string">SN</mso:Service_x0020_Note>\r\n<mso:Order msdt:dt="string">1493700.00000000</mso:Order>\r\n<mso:ContentType msdt:dt="string">Document</mso:ContentType>\r\n</mso:CustomDocumentProperties>\n</xml><![endif]-->\n </head>\n <link href="..\\..\\_format.css" rel="stylesheet" type="text/css"/>\n <body>\n <table>\n <tr>\n <td>\n <img border="0" src="SN_G2250_010//r1_logo1.gif"/>\n </td>\n <td align="left" width="178">\n <img border="0" src="SN_G2250_010//r1_logo2.gif"/>\n </td>\n <td>\n <div class="subtitle2">\n <b>\n <font color="red">\n Life Sciences and Chemical Analysis Service Note\n </font>\n </b>\n </div>\n </td>\n </tr>\n </table>\n <h2>\n SERVICE NOTE G2250-010\n </h2>\n <pre>Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNone\r\n\r\nSituation:\r\nChanges of operating software to Windows 2000 and implementation\r\nof ChemStation Rev. A.9.01 required some testing of the CCMode \r\n\r\nsoftware INB22000 / INB22002 / INB22003 and INB22004 Rev. A.03.02.\r\n\r\nSolution/Action:\r\nBefore using the Micro-plate Sampling Software INB22000 / INB22002 \r\n/ INB22003 or INB22004 Rev. A.03.02 (CCMode) on a PC with \r\nWindows 2000 a minor change in the "Control panel" must be made. \r\nIf this change is not made some icons in the user interface will \r\nnot be represented correctly. The functionality itself is not \r\ninfluenced:\r\n\r\nOpen "Settings", "Control Panel", "Display", "Appearance".\r\n\r\nGo to the "Scheme" and select the choice "Windows Classic". \r\nPress "OK" and close the "Control Panel" window.Required "Regional \r\nSettings" for both WIN NT and WIN2000\r\n\r\nIn order to run and edit parameters within CC-Mode your \r\nPC must be setup in this way:\r\n\r\n- Regional settings: English (United States)\r\n- Number format (default for English (United States)) \r\n Decimal symbol \'.\'\r\n- Number format (default for English (United States)) \r\n Digit grouping symbol \',\'\r\n\r\nNotes about using WIN2000:\r\n\r\n1. The installation and operation of CCMode (A.03.0x) and \r\nPurify SW (A.01.01) on the same PC is not recommended and \r\nnot supported.\r\n\r\n2. CCMode A.03.01 has not been tested. Customers owning \r\nthis version must upgrade to A.03.02 even if the additional \r\nfeatures for preparative analysis are not needed.\r\n\r\n3. The combination CCmode A.03.0x, ChemStation A.08.0x and \r\nWindows 2000 has not been tested and is not supported.\r\n\r\n\r\n\r\nDate:\r\n3/11/02\r\n******************************************************************************\r\n\r\n* Information Only
*\r\n******************************************************************************\r\n* Author/Entity: AG/B404 *\r\n* Additional Information: None
*\r\n******************************************************************************\r\n</pre>\n </body>\n</html>\n'
I define a raw string in Python 3.6.4:
r = r'Supersedes:?[\\r\\n ]+[\w\-\s]+[\\r\\n ]+(.*)[\\r\\n ]+Serial Numbers?:?[ \\r\\n]+.*?[ \\n\\r]\*+[\\n\\r ]+\*([A-Za-z ]+)[ \\n\\r]\*+[\\n\\r]+.*?\*+[ \\n\\r]+.*?\*\s+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'
, which I then use to search:
a = re.search(r, raw_string, re.M|re.S)
This returns no matches:
a[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not subscriptable
Although the exact same string and regex match on regex101:
https://regex101.com/r/qgJMbO/1
Can anyone tell me what the problem could be?
Edit:
The expected outcome is:
a[1] `INB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\
a[2] ' Information Only '
a[3] 'AG/B404 '
Upvotes: 0
Views: 99
Reputation: 2211
I have provided a solution using both BeautifulSoup
and re
from bs4 import BeautifulSoup as bs4
import re
docstring = '<html xmlns:msdt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mso="urn:schemas-microsoft-com:office:office">\n <head>\n <meta charset="utf-8"/>\n <title>\n SN G2250-010\n </title>\n <!--[if gte mso 9]><xml>\n<mso:CustomDocumentProperties>\r\n<mso:Service_x0020_Note msdt:dt="string">SN</mso:Service_x0020_Note>\r\n<mso:Order msdt:dt="string">1493700.00000000</mso:Order>\r\n<mso:ContentType msdt:dt="string">Document</mso:ContentType>\r\n</mso:CustomDocumentProperties>\n</xml><![endif]-->\n </head>\n <link href="..\\..\\_format.css" rel="stylesheet" type="text/css"/>\n <body>\n <table>\n <tr>\n <td>\n <img border="0" src="SN_G2250_010//r1_logo1.gif"/>\n </td>\n <td align="left" width="178">\n <img border="0" src="SN_G2250_010//r1_logo2.gif"/>\n </td>\n <td>\n <div class="subtitle2">\n <b>\n <font color="red">\n Life Sciences and Chemical Analysis Service Note\n </font>\n </b>\n </div>\n </td>\n </tr>\n </table>\n <h2>\n SERVICE NOTE G2250-010\n </h2>\n <pre>Supersedes: None\r\n \r\nINB22000 compatibility with Windows 2000 and ChemStation A.9.01\r\n\r\nSerial Numbers:\r\nUS00000000 - US99999999\r\n\r\nThe CCMode software is in general compatible with Windows 2000 and \r\nChemStation Revision A.9.01. Please see required settings!\r\n\r\nTo Be Performed By:\r\nAgilent-Qualified Personnel\r\n\r\nParts Required:\r\n\r\nNone\r\n\r\nSituation:\r\nChanges of operating software to Windows 2000 and implementation\r\nof ChemStation Rev. A.9.01 required some testing of the CCMode \r\n\r\nsoftware INB22000 / INB22002 / INB22003 and INB22004 Rev. A.03.02.\r\n\r\nSolution/Action:\r\nBefore using the Micro-plate Sampling Software INB22000 / INB22002 \r\n/ INB22003 or INB22004 Rev. A.03.02 (CCMode) on a PC with \r\nWindows 2000 a minor change in the "Control panel" must be made. \r\nIf this change is not made some icons in the user interface will \r\nnot be represented correctly. The functionality itself is not \r\ninfluenced:\r\n\r\nOpen "Settings", "Control Panel", "Display", "Appearance".\r\n\r\nGo to the "Scheme" and select the choice "Windows Classic". \r\nPress "OK" and close the "Control Panel" window.Required "Regional \r\nSettings" for both WIN NT and WIN2000\r\n\r\nIn order to run and edit parameters within CC-Mode your \r\nPC must be setup in this way:\r\n\r\n- Regional settings: English (United States)\r\n- Number format (default for English (United States)) \r\n Decimal symbol \'.\'\r\n- Number format (default for English (United States)) \r\n Digit grouping symbol \',\'\r\n\r\nNotes about using WIN2000:\r\n\r\n1. The installation and operation of CCMode (A.03.0x) and \r\nPurify SW (A.01.01) on the same PC is not recommended and \r\nnot supported.\r\n\r\n2. CCMode A.03.01 has not been tested. Customers owning \r\nthis version must upgrade to A.03.02 even if the additional \r\nfeatures for preparative analysis are not needed.\r\n\r\n3. The combination CCmode A.03.0x, ChemStation A.08.0x and \r\nWindows 2000 has not been tested and is not supported.\r\n\r\n\r\n\r\nDate:\r\n3/11/02\r\n******************************************************************************\r\n\r\n* Information Only *\r\n******************************************************************************\r\n* Author/Entity: AG/B404 *\r\n* Additional Information: None *\r\n******************************************************************************\r\n</pre>\n </body>\n</html>\n'
soup = bs4(docstring, 'lxml')
description_source = soup.find('pre')
s = description_source.text
r = 'Supersedes:?[\\r\\n ]+[\w\-\s]+[\\r\\n ]+(.*)[\\r\\n ]+Serial Numbers?:?[ \\r\\n]+.*?[ \\n\\r]\*+[\\n\\r ]+\*([A-Za-z ]+)[ \\n\\r]\*+[\\n\\r]+.*?\*+[ \\n\\r]+.*?\*\s+(?:Author[:\w\/]+ ([\.\w\/\s�]+))'
a = re.search(r, s, re.M|re.S)
s = s.split('\r\n')
print(s[2])
print(a[2])
print(a[3])
Returns:
INB22000 compatibility with Windows 2000 and ChemStation A.9.01
Information Only
AG/B404
Upvotes: 4