BPS
BPS

Reputation: 1231

How to tokenize source code of other programming language?

I want to parse some PHP code, i've made regex which should split PHP code to atoms ( https://regex101.com/r/P074q8/1 ) but when I try to execute it python is unable to split this source code like on regex101 website.

Why my regex is working on regex101.com and does not want to work in actual python script?

main.py

import re


class PHPParser:
    def __init__(self, filename):
        # read php file
        with open(filename, 'r') as f:
            self._source = f.read()

        syntax = [
            r'/\*.*?\*/',
            r'".*?"',
            r'\'.*?\'',
            r'\$[\w\d_]+', # variable name
            r'\w+', # function name
            r'return',
            r'<\?php',
            r'=>',
            r'\?>',
            r'\[',
            r'\]',
            r',',
            r';',
            r'\(',
            r'\)',
            r'\.',
            r'\n',
            r'\s',
            r'=',
            r'\W',
        ]

        s = r'(' + r'|'.join(syntax) + r')'
        print(s)
        tokens = re.split(s, self._source, re.DOTALL | re.M | re.I | re.UNICODE)

        print(tokens)


if __name__ == '__main__':
    p = PHPParser('./vendor/yiisoft/yii2/base/Widget.php')

Upvotes: 0

Views: 627

Answers (1)

Thm Lee
Thm Lee

Reputation: 1236

You can try this,

tokens = re.findall(s, self._source, re.DOTALL | re.M | re.I | re.UNICODE)

in which I simply repaced split() function with findall(), because you tried to get matching string in regex101.com by same regex, but in your python script, you tried to split by matching string.

Upvotes: 1

Related Questions