Naya Keto
Naya Keto

Reputation: 123

Extracting List from Within a String in Python

I am trying to work with the Cornell movie dataset to create a chatbot. Here is the format of the list of strings that I want to extract from, saved as conv_lines:

["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']"] 

I am trying to create the following list from the above list of strings by extracting the list inside each string.

[['L194', 'L195', 'L196', 'L197'],
 ['L198', 'L199'],
 ['L200', 'L201', 'L202', 'L203']]

I found this code but don't understand how it works. Would someone please explain.

convs = [ ]
for line in conv_lines[:-1]:
    _line = line.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
    convs.append(_line.split(','))

I don't understand why the [:-1] was used in the for statement, and after the code after the split.

Upvotes: 1

Views: 145

Answers (3)

reka18
reka18

Reputation: 8417

In order to understand what your question is, it helps to know the context. Fortunately I know exactly the context because I took the same Udemy course you are. ;)

convs = []
for line in conv_lines[:-1]:
    _line = line.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
    convs.append(_line.split(','))

for items in some_list[:-1] generally means your are iterating through the list up to and excluding the last item in that list.

For example:

l = [1,2,3,4]
for i in l[:-1]:
    print(i)
Out[ ]:
1
2
3

Now for what that means for the code you posted. In the for statement you are grabbing everything per line except the last item. So the last item must be trash of no use. Don't take my word for it. Check it. What does print(conv_lines[-1]) show you?

Now for the other use of [-1]. Try breaking it down first by working with only one line from your raw data.

line = "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']"
convs = []
_line = line.split(' +++$+++ ')[-1] # notice I truncated after this.
convs.append(_line.split(','))

What does this return?

convs
Out[ ]:
[["['L194'", " 'L195'", " 'L196'", " 'L197']"]]

And how about now.

convs = []
_line = line.split(' +++$+++ ')[-1][1:-1] # truncated again, but after adding back a bit.
convs.append(_line.split(','))

And what does this return?

convs
Out[ ]:
[["'L194'", " 'L195'", " 'L196'", " 'L197'"]]

Keep going.

convs = []
_line = line.split(' +++$+++ ')[-1][1:-1].replace("'","") # truncated less
convs.append(_line.split(','))

Returns:

convs
Out[ ]:
[['L194', ' L195', ' L196', ' L197']]

And finally:

convs = []
_line = line.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
convs.append(_line.split(','))

Returns what you need for the rest of the code provided by the superdatascience guys:

convs
Out[ ]:
[['L194', 'L195', 'L196', 'L197']]

Keep in mind that this example is working with only one line. With the for loop you will be populating convs list with a lot more than one list of 4-digit strings. Does that help?

Upvotes: 1

Rakesh
Rakesh

Reputation: 82785

  • Using re to find content between []
  • Using ast.literal_eval to get list object

Demo:

import re
import ast
data = ["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']"]

res = []
for i in data:
    val = re.findall(r"\[.*?\]", i)[0]
    res.append(ast.literal_eval(val))
print(res)

Output:

[['L194', 'L195', 'L196', 'L197'], ['L198', 'L199'], ['L200', 'L201', 'L202', 'L203']]

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71461

You can use ast.literal_eval and re:

import re, ast
d = ["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']","u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']", "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']"]
new_d = [ast.literal_eval(re.findall('\[[\w\W]+\]', i)[0]) for i in d]

Output:

[['L194', 'L195', 'L196', 'L197'], ['L198', 'L199'], ['L200', 'L201', 'L202', 'L203']]

Upvotes: 1

Related Questions