Vinicius Tavares
Vinicius Tavares

Reputation: 653

PHP reverse regex match

I am in a real trouble here to read a large txt file (around 12mb) with PHP. I have to match a regex, and then search for the first another regex occurrence backwards this matched regex, and then extract the string between these two matches. Here is a real example:

PROCESSO:583.00.2012.105981
No ORDEM:01.19.2012/000154
CLASSE:PROCEDIMENTO SUMÁRIO (EM GERAL)
REQUERENTE:ASSETJ ASSOCIAÇÃO DOS SERVIDORES DO TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO
ADVOGADO:273919/SP - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:583.00.2012.105970
No ORDEM:01.07.2012/000134
CLASSE:PROCEDIMENTO ORDINÁRIO (EM GERAL)
REQUERENTE:CARLOS NEUMANN
ADVOGADO:79117/SP - ROSANA CHIAVASSA
Requerido:SUL AMÉRICA SEGURO SAÚDE S/A
VARA:7a. VARA CÍVEL

The script should find this code: 273919/SP (regex: [0-9]{6}/SP) Check backwards for the code: 583.00.2012.105981 (regex: [0-9]{3}.[0-9]{2}.[0-9]{4}.[0-9]{6})

And then get all the text between it.

I can't do a preg_match with both of those regex at the same pattern because through the file some of the blocks have more than one 273919/SP type and it would mess up with everything

What can I do? Do you have any ideas?

Sorry if my regex is crappy, I am new at it and it is very difficult to learn :P

EDIT:

Please check another form that the code appears:

583.00.2012.100905-6/000000-000 - no ordem 82/2012 - Procedimento Sumário (em geral) - JOSE APARECIDO DOS
SANTOS X SEGURADORA LIDER DOS CONSORCIOS DO SEGUROS DPVAT S/A - Fls. 79 - Demonstre o autor, por meio
de documento idôneo (declaração de bens e renda e comprovante de pagamento), a necessidade de obtenção do benefício
da justiça gratuita, a fim de ser cumprido o disposto no artigo 5o, LXXIV da CF. Após, tornem os autos conclusos. Int. - ADV
GUILHERME DIAS GONÇALVES OAB/SP 302632 - ADV TIAGO RAFAEL OLIVEIRA ALEGRE OAB/SP 302811

That is my problem. Now I have two occurrences: OAB/SP 302632 and OAB/SP 302811, and I need to get the last one and extract the text between the id 583.00.2012.100905-6/000000-000 and OAB/SP 302811

Those numbers aren't fixed, so I can't do a search for OAB/SP 302811, but OAB\/SP\s\d{6}

Upvotes: 0

Views: 2388

Answers (6)

ghoti
ghoti

Reputation: 46846

You're trying to extract the lines between PROCESS0 and ADVOGADO for each record, where records are idenfitied by a new PROCESS0 line?

For a very large consistently formatted text file like this, I wouldn't use regexp this way at all. I'd use standard file handling and do my own record keeping.

<?php

$fh = fopen("/path/to/file.txt", "r");

$keep = 0;
$buffer = "";

while ($line = fgets($fh, 80)) {
  if (strpos($line, "PROCESSO:") !== FALSE) {
    $keep = 1;
    continue;
  }
  if (strpos($line, "ADVOGADO:") !== FALSE) {
    print $buffer; // or do whatever you want with it
    $keep = 0;
    $buffer = "";
    continue;
  }
  if ($keep == 1) {
    $buffer .= $line;
  }
}

?>

Upvotes: 1

Qtax
Qtax

Reputation: 33908

You have two expressions, re1 and re2, and you want to match re1 and then find the first re2 match before it, and get the content between them.

Assuming that there's always a re2 match before a re1 match, then this is equivalent to: Match re2, followed by a string not containing any re2 matches and capturing it, followed by a re1 match.

This can be written as:

(?s)re2((?:(?!re2).)*?)re1

If re1 is \d{6}/SP and re2 is \d{3}\.\d{2}\.\d{4}\.\d{6} you get:

(?s)(\d{3}\.\d{2}\.\d{4}\.\d{6})((?:(?!\d{3}\.\d{2}\.\d{4}\.\d{6}).)*?)(\d{6}/SP)

I've put the re1 and re2 matches in capturing groups here in case you'd want their values as well.

Upvotes: 2

mario
mario

Reputation: 145482

I would assume it is actually as simple as just looking for the two keys/id tokens and fetching the text block in between with an .*? substitute:

 preg_match_all('~

     (?: ^  PROCESSO:  \d+(?:\.\d+){3}  \s* )
   ( (?: ^  [\w\s]+:   .*               \s* )+ )  # multiple lines in between
     (?: ^  ADVOGADO:  273919/SP            )

     ~mx',
     $input, $matches
 )
 and print_r($matches);

This looks for your data block, and will return the middle part in $matches[1]. So you could use end($matches[1]) to get the last entry for the 273919/SP id. You probably don't need that much assertion for the inner text, just as illustration to avoid the empty lines.

But in essence, you don't "match in reverse", but simply make it more specific for the inner part. Then you can just list the two things you want to search for in the very order they would occur in your file.

Upvotes: 1

FtDRbwLXw6
FtDRbwLXw6

Reputation: 28889

I don't see why you have to do some weird backwards search. Just do something like this:

$search = 273919; // assume this would come from user input of some sort?
preg_match('#PROCESSO:(\d{3}\.\d{2}\.\d{4}\.\d{6}).+?ADVOGADO:' . preg_quote($search, '#') . '/SP#ms', $fileContents, $matches);
echo $matches[1]; // 583.00.2012.105981

Upvotes: 1

Susam Pal
Susam Pal

Reputation: 34204

<?php

$txt = <<<TEXT
PROCESSO:583.00.2012.105981
No ORDEM:01.19.2012/000154
CLASSE:PROCEDIMENTO SUMÁRIO (EM GERAL)
REQUERENTE:ASSETJ ASSOCIAÇÃO DOS SERVIDORES DO TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO
ADVOGADO:273919/SP - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:583.00.2012.105970
No ORDEM:01.07.2012/000134
CLASSE:PROCEDIMENTO ORDINÁRIO (EM GERAL)
REQUERENTE:CARLOS NEUMANN
ADVOGADO:79117/SP - ROSANA CHIAVASSA
Requerido:SUL AMÉRICA SEGURO SAÚDE S/A
VARA:7a. VARA CÍVEL
TEXT;

$matches = array();
preg_match('/[0-9]{6}\/SP(.*)[0-9]{3}.[0-9]{2}.[0-9]{4}.[0-9]{6}/s', $txt, $matches) . "\n";
echo $matches[1];
?>

Output:

 - THIAGO PUGINA
Requerido:TIM CELULAR S/A E OUTRO
VARA:19a. VARA CÍVEL

PROCESSO:

Upvotes: 0

Jeremy Harris
Jeremy Harris

Reputation: 24549

It seems your data has a repeating pattern. If so, you could explode() it into an array and process each array element individually which effectively limits the scope of your regex calls.

// Get data
$file_data = get_file_contents('/path/to/my/file.txt');

// Explode data into chunks using repeated delimiter
$data = explode("PROCESSO:", $file_data);

// Process array
foreach($data as $chunk)
{
    // Perform regex functions on $chunk here
}

Upvotes: -1

Related Questions