Reputation: 63

Extracting text from docx as key value pair giving attribute error

I have a Docx files which looks like this

Requisition No: VOI9053459-
 
Job location: Melbourn
 
Exp : 2 – 4 Years
 
Notice period :-15day or less

with other details in it. I wanted to extract certain key-value pair from the document and save it as a dictionary. the document has been extracted and assigned to

for child in parent_elm.iterchildren():
    if isinstance(child, CT_P):
        yield Paragraph(child, parent)
text = block.text

My progress so far is

job_location = re.compile(r'(^Job?.*\S+?)')
notice_period = re.compile(r'(^Notice?.*\d\w*.+\S+?)')
experience = re.compile(r'(^Exp.*\S+?)')

job_location = job_location.search(text)
key_value1 = job_location.group()
split1 = re.split(': |-', key_value1)
keys.append(split1[0])
data.append(split1[1])

notice_period = notice_period.search(text)
key_value2 = notice_period.group()
split2 = re.split(': |-', key_value2)
keys.append(split2[0])
data.append(split2[1])

experience = experience.search(text)
key_value3 = experience.group()
split3 = re.split(': |-', key_value3)
keys.append(split3[0])
data.append(split3[1])

for key in keys:
    col.append((key, []))
i = 0
for j in range(len(data)):
    T = data[j]

    col[i][1].append(T)
    i += 1
Dict = {keys: data for (keys, data) in col}

print(Dict)

I am getting the attribute error

> AttributeError                            
Traceback (most recent call last) <ipython-input-261-84c60112ddb2> in <module>
>      82 
>      83 
> ---> 84 convert_docx_to_text(file_path=(r'data_extraction.docx'))
> 
> <ipython-input-261-84c60112ddb2> in convert_docx_to_text(file_path)
>      51 
>      52             job_location=job_location.search(text)
> ---> 53             key_value1=job_location.group()
>      54             split1=re.split(': |-',key_value1)
>      55             keys.append(split1[0])
> 
> AttributeError: 'NoneType' object has no attribute 'group'

Why is it not working? Any help is appreciated. Thanks

Upvotes: 0

Answers (3)

0dminnimda

Reputation: 1463

pay attention to this line job_location = job_location.search(text), the search was unsuccessful, because job_location = None, and therefore an error occurs that you won’t get anything from None, you either need to change the text or work with job_location

Upvotes: 0

Booboo

Reputation: 44148

There are several problems with your regular expressions. Let's take the regex for searching for the job location. You have:

r'(^Job?.*\S+?)'

First, without using flags=re.MULTILINE, the ^ character will only match the start of string rather than the start of a line.
Job? matches Jo optionally followed by a b.
In the absence of flags=re.DOTALL, .* will greedily match any non-newline character 0 or more times.
\S+? will optionally match 1 or more non-white space characters.

For example, your regex would match the line: Joabcdefg with .* matching abcdef and \S+? matching g.

You also make a call on the group method of a match object. But you must pass an integer to this method specifying which group number you are interested in.

The regex you need to search for a job location is given in the following example:

import re

text = """=Requisition No: VOI9053459-

Job location: Melbourn

Exp : 2 – 4 Years

Notice period :-15day or less"""

job_location_re = re.compile(r'(?:^Job\s+location:\s+)(.*)$', re.MULTILINE)
m = job_location_re.search(text)
if m: # there is a match
    job_location = m.group(1)
    print(job_location)

Prints:

Melbourn

Notice that I have called the compiled regex job_location_re rather than using the name job_location for both the regex and the name of the location.

(?^Job\s+location:\s+) matches Job location: at the start of a line allowing one or more spaces between Job and location: and after location:. This is done in a non-capturing group.
(.*) matches greedily any non-newline character up to the end of line. This will be group 1.
$ matches the end of line.

Upvotes: 1

Wonka

Reputation: 1886

When you get this code, this should work

Edit 1 to improve code, split only first coincidence

split3=re.split(': |-',key_value3, 1)
#keys.append(split3[0])
#data.append(split3[1])
k,v = split3
your_dict[k] = v

Upvotes: 0

Extracting text from docx as key value pair giving attribute error

Answers (3)

Related Questions