Reputation: 5049
I'm trying to extract text in specific parts of an MS word document (link) - sample below. Essentially I need to write all text with the tags -- ASN1START
and -- ASN1STOP
to a file excluding the aforementioned tags.
sample text
-- ASN1START
CounterCheck ::= SEQUENCE {
rrc-TransactionIdentifier RRC-TransactionIdentifier,
criticalExtensions CHOICE {
c1 CHOICE {
counterCheck-r8 CounterCheck-r8-IEs,
spare3 NULL, spare2 NULL, spare1 NULL
},
criticalExtensionsFuture SEQUENCE {}
}
}
CounterCheck-r8-IEs ::= SEQUENCE {
drb-CountMSB-InfoList DRB-CountMSB-InfoList,
nonCriticalExtension CounterCheck-v8a0-IEs OPTIONAL
}
CounterCheck-v8a0-IEs ::= SEQUENCE {
lateNonCriticalExtension OCTET STRING OPTIONAL,
nonCriticalExtension CounterCheck-v1530-IEs OPTIONAL
}
CounterCheck-v1530-IEs ::= SEQUENCE {
drb-CountMSB-InfoListExt-r15 DRB-CountMSB-InfoListExt-r15 OPTIONAL, -- Need ON
nonCriticalExtension SEQUENCE {} OPTIONAL
}
DRB-CountMSB-InfoList ::= SEQUENCE (SIZE (1..maxDRB)) OF DRB-CountMSB-Info
DRB-CountMSB-InfoListExt-r15 ::= SEQUENCE (SIZE (1..maxDRBExt-r15)) OF DRB-CountMSB-Info
DRB-CountMSB-Info ::= SEQUENCE {
drb-Identity DRB-Identity,
countMSB-Uplink INTEGER(0..33554431),
countMSB-Downlink INTEGER(0..33554431)
}
-- ASN1STOP
I have tried using docx
.
from docx import *
import re
import json
fileName = './data/36331-f80.docx'
document = Document(fileName)
startText = re.compile(r'-- ASN1START')
for para in document.paragraphs:
# look for each paragraph
text = para.text
print(text)
# if startText.match(para.text):
# print(text)
It seems every line here with the tags mentioned above is a paragraph. I need help with extracting just the text within the tags.
Upvotes: 1
Views: 1170
Reputation: 520878
You may try first reading all document/paragraph text into a single string, and then using re.findall
to find all matching text in between the target tags:
text = ""
for para in document.paragraphs:
text += para.text + "\n"
matches = re.findall(r'-- ASN1START\s*(.*?)\s*-- ASN1STOP', text, flags=re.DOTALL)
Note that we use DOT ALL mode with the regex to ensure that .*
can match content in between the tags which occurs across newlines.
Upvotes: 1