Reputation: 77
I need to extract a specific portion of text that is in a Word (.docx
). The document has the following structure:
Question 1:
How many ítems…
two
four
five
ten
Explanation:
There are four ítems in the bag.
Question 2:
How many books…
two
four
five
Explanation:
There are four books in the bag.
With this information I have to create a Dataframe
like this one:
I'm able to open the document, extract the text and print the lines starting with , but I'm not able to extract the rest of the string of interest and create the Dataframe
.
My code is:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
text=getText('document.docx')
text
strings = re.findall(r" (.+)\n", text)
Any help? Thanks in advance
Upvotes: 0
Views: 180
Reputation: 1282
I would suggest you expand your regular expression to include all of the information you need. In this case I think you'll need two passes - one to get each question, and a second to parse the possible answers.
Take a look at your source text and break it down into the parts you need. Each item starts with Question n:
, then a line for the actual questions, multiple lines for each possible response, followed by Explanation and a line for the explanation. We'll use the grouping operator to extract the parts on interest.
The Question line can be described by the following pattern:
"Question ([0-9]+):\n"
The line that represents the actual question is just text:
"(.+)\n"
The collection of possible responses is a series of lines beginning with a special character (I've replaced it with '*' because I can't tell what character it is from the post), (allowing for possible whitespace)
\*\s*.+\n
but we can get the whole list of them using a combination of grouping including the non-capturing group:
((?:\*\s*.+\n)+)
That causes any number of matching lines to be captured as a single group.
Finally you have "Explanation" possibly preceded by some whitespace, and followed by a line of text:
\s*Explanation:\n(.+)\n
If we put these all together, our regex pattern is
r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
Parsing this:
patt = r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
matches = re.findall(patt, text)
yields:
[('1',
'How many ítems…',
'* two\n* four\n* five\n* ten\n',
'There are four ítems in the bag.'),
('2',
'How many books…',
'* two\n* four\n* five\n',
'There are four books in the bag.')]
Where each entry is a tuple. The 3rd item in each tuple is a text of all of the answers as a group, which you'll need to further break down.
The regex to match your answers (using the character '*') is:
\*\s*(.+)\n
Grouping it to eliminate the character, we can use:
r"(?:\*\s*(.+)\n)"
Finally, using a list comprehension we can replace the string value for the answers with a list:
matches = [ tuple([x[0],x[1],re.findall(r"(?:\*\s*(.+)\n)", x[2]),x[3]) for x in matches]
Yielding the result:
[('1',
'How many ítems…',
['two', 'four', 'five', 'ten'],
'There are four ítems in the bag.'),
('2',
'How many books…',
['two', 'four', 'five'],
'There are four books in the bag.')]
Now you should be prepared to massage that into your dataframe.
Upvotes: 1