El-VeRdUgO
El-VeRdUgO

Reputation: 41

Extract a string between other two in Python

I am trying to extract the comments from a fdf (PDF comment file). In practice, this is to extract a string between other two. I did the following:

  1. I open the fdf file with the following command:

    import re
    import os
    
    os.chdir("currentworkingdirectory")
    archcom =open("comentarios.fdf", "r")
    cadena = archcom.read()
    
  2. With the opened file, I create a string called cadena with all the info I need. For example:

    cadena = "\n215 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<</W 3.0>>\nendobj\n219 0 obj\n<</W 3.0>>\nendobj\ntrailer\n<</Root 1 0 R>>\n%%EOF\n"
    
  3. I try to extract the needed info with the following line:

    a = re.findall(r"nendobj(.*?)W 3\.0",cadena)
    

Trying to get:

a = "n216 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n217 0 obj\n<</D[2.0 2.0]/S/D>>\nendobj\n218 0 obj\n<<"

But I got:

a = []

The problem is in the line a = re.findall(r"nendobj(.*?)W 3\.0",cadena) but I don't realize where. I have tried many combinations with no success.

Upvotes: 0

Views: 63

Answers (1)

Thomas Weller
Thomas Weller

Reputation: 59207

It seems to me that there are 2 problems:

a) you are looking for nendobj, but the N is actually part of the line break \n. Thus you'll also not get a leading N in the output, because there is no N.

b) Since the text you're looking for crosses some newlines, you need the re.DOTALL flag

Final code:

a = re.findall("endobj(.*?)W 3\.0",cadena, re.DOTALL)

Also note, that there will be a second result, confirmed by Regex101.

Upvotes: 1

Related Questions