Reputation: 41
I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:
I open the fdf file with the following command:
import re
import os
os.chdir("currentworkingdirectory")
archcom =open("comentarios.fdf", "r")
cadena = archcom.read()
With the opened file, I create a string called cadena with all the info I need. For example:
cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
I try to extract the needed info with the following line:
a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
Trying to get:
a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"
But I got:
a = []
The problem is in the line a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
but I don't realize where. I have tried many combinations with no success.
Upvotes: 0
Views: 63
Reputation: 59207
It seems to me that there are 2 problems:
a) you are looking for nendobj
, but the N is actually part of the line break \n
. Thus you'll also not get a leading N in the output, because there is no N.
b) Since the text you're looking for crosses some newlines, you need the re.DOTALL
flag
Final code:
a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)
Also note, that there will be a second result, confirmed by Regex101.
Upvotes: 1