Reputation: 63
I have a Docx files which looks like this
Requisition No: VOI9053459-
Job location: Melbourn
Exp : 2 – 4 Years
Notice period :-15day or less
with other details in it. I wanted to extract certain key-value pair from the document and save it as a dictionary. the document has been extracted and assigned to
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
text = block.text
My progress so far is
job_location = re.compile(r'(^Job?.*\S+?)')
notice_period = re.compile(r'(^Notice?.*\d\w*.+\S+?)')
experience = re.compile(r'(^Exp.*\S+?)')
job_location = job_location.search(text)
key_value1 = job_location.group()
split1 = re.split(': |-', key_value1)
keys.append(split1[0])
data.append(split1[1])
notice_period = notice_period.search(text)
key_value2 = notice_period.group()
split2 = re.split(': |-', key_value2)
keys.append(split2[0])
data.append(split2[1])
experience = experience.search(text)
key_value3 = experience.group()
split3 = re.split(': |-', key_value3)
keys.append(split3[0])
data.append(split3[1])
for key in keys:
col.append((key, []))
i = 0
for j in range(len(data)):
T = data[j]
col[i][1].append(T)
i += 1
Dict = {keys: data for (keys, data) in col}
print(Dict)
I am getting the attribute error
> AttributeError
Traceback (most recent call last) <ipython-input-261-84c60112ddb2> in <module>
> 82
> 83
> ---> 84 convert_docx_to_text(file_path=(r'data_extraction.docx'))
>
> <ipython-input-261-84c60112ddb2> in convert_docx_to_text(file_path)
> 51
> 52 job_location=job_location.search(text)
> ---> 53 key_value1=job_location.group()
> 54 split1=re.split(': |-',key_value1)
> 55 keys.append(split1[0])
>
> AttributeError: 'NoneType' object has no attribute 'group'
Why is it not working? Any help is appreciated. Thanks
Upvotes: 0
Views: 271
Reputation: 1463
pay attention to this line job_location = job_location.search(text)
, the search was unsuccessful, because job_location = None
, and therefore an error occurs that you won’t get anything from None
, you either need to change the text
or work with job_location
Upvotes: 0
Reputation: 44148
There are several problems with your regular expressions. Let's take the regex for searching for the job location. You have:
r'(^Job?.*\S+?)'
flags=re.MULTILINE
, the ^
character will only match the start of string rather than the start of a line.Job?
matches Jo
optionally followed by a b
.flags=re.DOTALL
, .*
will greedily match any non-newline character 0 or more times.\S+?
will optionally match 1 or more non-white space characters.For example, your regex would match the line: Joabcdefg
with .*
matching abcdef
and \S+?
matching g
.
You also make a call on the group
method of a match
object. But you must pass an integer to this method specifying which group number you are interested in.
The regex you need to search for a job location is given in the following example:
import re
text = """=Requisition No: VOI9053459-
Job location: Melbourn
Exp : 2 – 4 Years
Notice period :-15day or less"""
job_location_re = re.compile(r'(?:^Job\s+location:\s+)(.*)$', re.MULTILINE)
m = job_location_re.search(text)
if m: # there is a match
job_location = m.group(1)
print(job_location)
Prints:
Melbourn
Notice that I have called the compiled regex job_location_re
rather than using the name job_location
for both the regex and the name of the location.
(?^Job\s+location:\s+)
matches Job location:
at the start of a line allowing one or more spaces between Job
and location:
and after location:
. This is done in a non-capturing group.(.*)
matches greedily any non-newline character up to the end of line. This will be group 1.$
matches the end of line.Upvotes: 1
Reputation: 1886
When you get this code, this should work
Edit 1 to improve code, split only first coincidence
split3=re.split(': |-',key_value3, 1)
#keys.append(split3[0])
#data.append(split3[1])
k,v = split3
your_dict[k] = v
Upvotes: 0