Reputation: 19957
I have a string like below:
s = '[(a,b),(c,d),(e,f)]'
How can I convert it to a list of below:
[('a','b'),('c','d'),('e','f')]
Please note in the string, the elements are not quoted.
Also, I know it can be done using a lot of splits or regex, is there another way to evaluate it as a list?
Upvotes: 2
Views: 824
Reputation: 86433
A general solution to this would require implementing a parser, but your simple example can be solved with a regex and a list comprehension:
>>> import re
>>> [tuple(x.split(',')) for x in re.findall("\((.*?)\)", s)]
[('a', 'b'), ('c', 'd'), ('e', 'f')]
If you want to use Python's parser to do the parsing for you, you could do something like this:
>>> import ast
>>> parsed = ast.parse(s)
>>> [tuple(el.id for el in t.elts) for t in parsed.body[0].value.elts]
[('a', 'b'), ('c', 'd'), ('e', 'f')]
Though keep in mind both these approaches assume your input has a very particular structure.
The most complete approach would be to implement a parser specific to the form of the input that you expect, using a tool like https://www.dabeaz.com/ply/
Here is an example: you can put this parsing code in a module named parser.py
:
# parser.py
import os
import ply.lex as lex
import ply.yacc as yacc
class ParserBase:
"""
Base class for a lexer/parser that has the rules defined as methods
"""
def __init__(self, **kw):
self.debug = kw.get("debug", 0)
modname = (
os.path.split(os.path.splitext(__file__)[0])[1]
+ "_"
+ self.__class__.__name__
)
self.debugfile = modname + ".dbg"
self.tabmodule = modname + "_" + "parsetab"
# Build the lexer and parser
lex.lex(module=self, debug=self.debug)
yacc.yacc(
module=self,
debug=self.debug,
debugfile=self.debugfile,
tabmodule=self.tabmodule,
)
def parse(self, expression):
return yacc.parse(expression)
class Parser(ParserBase):
tokens = (
"NAME",
"COMMA",
"LPAREN",
"RPAREN",
"LBRACKET",
"RBRACKET",
)
# Tokens
t_COMMA = r","
t_LPAREN = r"\("
t_RPAREN = r"\)"
t_LBRACKET = r"\["
t_RBRACKET = r"\]"
t_NAME = r"[a-zA-Z_][a-zA-Z0-9_]*"
def t_error(self, t):
raise ValueError("Illegal character '%s'" % t.value[0])
def p_expression(self, p):
"""
expression : name
| list
| tuple
"""
p[0] = p[1]
def p_name(self, p):
"name : NAME"
p[0] = str(p[1])
def p_list(self, p):
"""
list : LBRACKET RBRACKET
| LBRACKET arglist RBRACKET
"""
if len(p) == 3:
p[0] = []
elif len(p) == 4:
p[0] = list(p[2])
def p_tuple(self, p):
"""
tuple : LPAREN RPAREN
| LPAREN arglist RPAREN
"""
if len(p) == 3:
p[0] = tuple()
elif len(p) == 4:
p[0] = tuple(p[2])
def p_arglist(self, p):
"""
arglist : arglist COMMA expression
| expression
"""
if len(p) == 4:
p[0] = p[1] + [p[3]]
else:
p[0] = [p[1]]
def p_error(self, p):
if p:
raise ValueError(f"Syntax error at '{p.value}'")
else:
raise ValueError("Syntax error at EOF")
Then use it this way:
>>> from parser import Parser
>>> p = Parser()
>>> p.parse('[(a,b),(c,d),(e,f)]')
[('a', 'b'), ('c', 'd'), ('e', 'f')]
This should work for arbitrarily-nested inputs:
>>> p.parse('[(a,b),(c,d),([(e,f,g),h,i],j)]')
[('a', 'b'), ('c', 'd'), ([('e', 'f', 'g'), 'h', 'i'], 'j')]
And will give you a nice error if your string doesn't match the parsing rules:
>>> p.parse('[a,b,c)')
...
ValueError: Syntax error at ')'
Upvotes: 5
Reputation: 106881
Since the input is actually valid Python code, you can properly parse it with tokenize.generate_tokens
, and enclose each token in single quotes if it is a NAME
token:
from tokenize import generate_tokens, NAME
from io import StringIO
file = StringIO('[(a,b),(c,d),(e,f)]')
output = ''.join(f"'{token}'" if token_type == NAME else token
for token_type, token, *_ in generate_tokens(file.readline))
output
becomes:
[('a','b'),('c','d'),('e','f')]
Demo: https://repl.it/@blhsing/SecondAdmirableNormalform
Upvotes: 4
Reputation: 1291
import re
s = '[(a,b),(c,d),(e,f)]'
listOfElements = []
for element in re.findall('\(.*?\)',s):
element = element[1:-1].split(',')
listOfElements.append((element[0],element[1]))
That's not a lot of splits/regex :D
Upvotes: 1