Allen Qin
Allen Qin

Reputation: 19957

Convert a string representation of a list of tuples to a list when elements are not quoted

I have a string like below:

s = '[(a,b),(c,d),(e,f)]'

How can I convert it to a list of below:

[('a','b'),('c','d'),('e','f')]

Please note in the string, the elements are not quoted.

Also, I know it can be done using a lot of splits or regex, is there another way to evaluate it as a list?

Upvotes: 2

Views: 824

Answers (3)

jakevdp
jakevdp

Reputation: 86433

A general solution to this would require implementing a parser, but your simple example can be solved with a regex and a list comprehension:

>>> import re
>>> [tuple(x.split(',')) for x in re.findall("\((.*?)\)", s)]
[('a', 'b'), ('c', 'd'), ('e', 'f')]

If you want to use Python's parser to do the parsing for you, you could do something like this:

>>> import ast
>>> parsed = ast.parse(s)
>>> [tuple(el.id for el in t.elts) for t in parsed.body[0].value.elts]
[('a', 'b'), ('c', 'd'), ('e', 'f')]

Though keep in mind both these approaches assume your input has a very particular structure.


The most complete approach would be to implement a parser specific to the form of the input that you expect, using a tool like https://www.dabeaz.com/ply/

Here is an example: you can put this parsing code in a module named parser.py:

# parser.py
import os

import ply.lex as lex
import ply.yacc as yacc

class ParserBase:
    """
    Base class for a lexer/parser that has the rules defined as methods
    """
    def __init__(self, **kw):
        self.debug = kw.get("debug", 0)
        modname = (
            os.path.split(os.path.splitext(__file__)[0])[1]
            + "_"
            + self.__class__.__name__
        )
        self.debugfile = modname + ".dbg"
        self.tabmodule = modname + "_" + "parsetab"

        # Build the lexer and parser
        lex.lex(module=self, debug=self.debug)
        yacc.yacc(
            module=self,
            debug=self.debug,
            debugfile=self.debugfile,
            tabmodule=self.tabmodule,
        )

    def parse(self, expression):
        return yacc.parse(expression)


class Parser(ParserBase):

    tokens = (
        "NAME",
        "COMMA",
        "LPAREN",
        "RPAREN",
        "LBRACKET",
        "RBRACKET",
    )

    # Tokens

    t_COMMA = r","
    t_LPAREN = r"\("
    t_RPAREN = r"\)"
    t_LBRACKET = r"\["
    t_RBRACKET = r"\]"
    t_NAME = r"[a-zA-Z_][a-zA-Z0-9_]*"

    def t_error(self, t):
        raise ValueError("Illegal character '%s'" % t.value[0])

    def p_expression(self, p):
        """
        expression : name
                   | list
                   | tuple
        """
        p[0] = p[1]

    def p_name(self, p):
        "name : NAME"
        p[0] = str(p[1])

    def p_list(self, p):
        """
        list : LBRACKET RBRACKET
             | LBRACKET arglist RBRACKET
        """
        if len(p) == 3:
            p[0] = []
        elif len(p) == 4:
            p[0] = list(p[2])

    def p_tuple(self, p):
        """
        tuple : LPAREN RPAREN
              | LPAREN arglist RPAREN
        """
        if len(p) == 3:
            p[0] = tuple()
        elif len(p) == 4:
            p[0] = tuple(p[2])

    def p_arglist(self, p):
        """
        arglist : arglist COMMA expression
                | expression
        """
        if len(p) == 4:
            p[0] = p[1] + [p[3]]
        else:
            p[0] = [p[1]]

    def p_error(self, p):
        if p:
            raise ValueError(f"Syntax error at '{p.value}'")
        else:
            raise ValueError("Syntax error at EOF")

Then use it this way:

>>> from parser import Parser
>>> p = Parser()
>>> p.parse('[(a,b),(c,d),(e,f)]')
[('a', 'b'), ('c', 'd'), ('e', 'f')]

This should work for arbitrarily-nested inputs:

>>> p.parse('[(a,b),(c,d),([(e,f,g),h,i],j)]')
[('a', 'b'), ('c', 'd'), ([('e', 'f', 'g'), 'h', 'i'], 'j')]

And will give you a nice error if your string doesn't match the parsing rules:

>>> p.parse('[a,b,c)')
...
ValueError: Syntax error at ')'

Upvotes: 5

blhsing
blhsing

Reputation: 106881

Since the input is actually valid Python code, you can properly parse it with tokenize.generate_tokens, and enclose each token in single quotes if it is a NAME token:

from tokenize import generate_tokens, NAME
from io import StringIO

file = StringIO('[(a,b),(c,d),(e,f)]')
output = ''.join(f"'{token}'" if token_type == NAME else token
                 for token_type, token, *_ in generate_tokens(file.readline))

output becomes:

 [('a','b'),('c','d'),('e','f')]

Demo: https://repl.it/@blhsing/SecondAdmirableNormalform

Upvotes: 4

Swetank Poddar
Swetank Poddar

Reputation: 1291

import re

s = '[(a,b),(c,d),(e,f)]'

listOfElements = []

for element in re.findall('\(.*?\)',s):
    element = element[1:-1].split(',')
    listOfElements.append((element[0],element[1]))

That's not a lot of splits/regex :D

Upvotes: 1

Related Questions