Ley
Ley

Reputation: 77

Extract text from word and convert into Dataframe

I need to extract a specific portion of text that is in a Word (.docx). The document has the following structure:

Question 1:
How many ítems…
 two
 four
 five
 ten
Explanation:
There are four ítems in the bag.
Question 2:
How many books…
 two
 four
 five

Explanation:
There are four books in the bag.

With this information I have to create a Dataframe like this one: enter image description here

I'm able to open the document, extract the text and print the lines starting with  , but I'm not able to extract the rest of the string of interest and create the Dataframe.

My code is:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

text=getText('document.docx')
text
strings = re.findall(r" (.+)\n", text)

Any help? Thanks in advance

Upvotes: 0

Views: 180

Answers (1)

RichGoldMD
RichGoldMD

Reputation: 1282

I would suggest you expand your regular expression to include all of the information you need. In this case I think you'll need two passes - one to get each question, and a second to parse the possible answers.

Take a look at your source text and break it down into the parts you need. Each item starts with Question n:, then a line for the actual questions, multiple lines for each possible response, followed by Explanation and a line for the explanation. We'll use the grouping operator to extract the parts on interest.

The Question line can be described by the following pattern:

"Question ([0-9]+):\n" 

The line that represents the actual question is just text:

"(.+)\n"

The collection of possible responses is a series of lines beginning with a special character (I've replaced it with '*' because I can't tell what character it is from the post), (allowing for possible whitespace)

\*\s*.+\n

but we can get the whole list of them using a combination of grouping including the non-capturing group:

((?:\*\s*.+\n)+)

That causes any number of matching lines to be captured as a single group.

Finally you have "Explanation" possibly preceded by some whitespace, and followed by a line of text:

\s*Explanation:\n(.+)\n

If we put these all together, our regex pattern is

r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"

Parsing this:

patt = r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
matches = re.findall(patt, text)

yields:

[('1',
  'How many ítems…',
  '* two\n* four\n* five\n* ten\n',
  'There are four ítems in the bag.'),
 ('2',
  'How many books…',
  '* two\n* four\n* five\n',
  'There are four books in the bag.')]

Where each entry is a tuple. The 3rd item in each tuple is a text of all of the answers as a group, which you'll need to further break down.

The regex to match your answers (using the character '*') is:

\*\s*(.+)\n

Grouping it to eliminate the character, we can use:

r"(?:\*\s*(.+)\n)"

Finally, using a list comprehension we can replace the string value for the answers with a list:

matches = [ tuple([x[0],x[1],re.findall(r"(?:\*\s*(.+)\n)", x[2]),x[3]) for x in matches]

Yielding the result:

[('1',
  'How many ítems…',
  ['two', 'four', 'five', 'ten'],
  'There are four ítems in the bag.'),
 ('2',
  'How many books…',
  ['two', 'four', 'five'],
  'There are four books in the bag.')]

Now you should be prepared to massage that into your dataframe.

Upvotes: 1

Related Questions