Reputation: 123
I am trying to work with the Cornell movie dataset to create a chatbot. Here is the format of the list of strings that I want to extract from, saved as conv_lines:
["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
"u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
"u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']"]
I am trying to create the following list from the above list of strings by extracting the list inside each string.
[['L194', 'L195', 'L196', 'L197'],
['L198', 'L199'],
['L200', 'L201', 'L202', 'L203']]
I found this code but don't understand how it works. Would someone please explain.
convs = [ ]
for line in conv_lines[:-1]:
_line = line.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
convs.append(_line.split(','))
I don't understand why the [:-1] was used in the for statement, and after the code after the split.
Upvotes: 1
Views: 145
Reputation: 8417
In order to understand what your question is, it helps to know the context. Fortunately I know exactly the context because I took the same Udemy course you are. ;)
convs = []
for line in conv_lines[:-1]:
_line = line.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
convs.append(_line.split(','))
for items in some_list[:-1]
generally means your are iterating through the list up to and excluding the last item in that list.
For example:
l = [1,2,3,4]
for i in l[:-1]:
print(i)
Out[ ]:
1
2
3
Now for what that means for the code you posted. In the for
statement you are grabbing everything per line except the last item. So the last item must be trash of no use. Don't take my word for it. Check it. What does print(conv_lines[-1])
show you?
Now for the other use of [-1]. Try breaking it down first by working with only one line from your raw data.
line = "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']"
convs = []
_line = line.split(' +++$+++ ')[-1] # notice I truncated after this.
convs.append(_line.split(','))
What does this return?
convs
Out[ ]:
[["['L194'", " 'L195'", " 'L196'", " 'L197']"]]
And how about now.
convs = []
_line = line.split(' +++$+++ ')[-1][1:-1] # truncated again, but after adding back a bit.
convs.append(_line.split(','))
And what does this return?
convs
Out[ ]:
[["'L194'", " 'L195'", " 'L196'", " 'L197'"]]
Keep going.
convs = []
_line = line.split(' +++$+++ ')[-1][1:-1].replace("'","") # truncated less
convs.append(_line.split(','))
Returns:
convs
Out[ ]:
[['L194', ' L195', ' L196', ' L197']]
And finally:
convs = []
_line = line.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
convs.append(_line.split(','))
Returns what you need for the rest of the code provided by the superdatascience guys:
convs
Out[ ]:
[['L194', 'L195', 'L196', 'L197']]
Keep in mind that this example is working with only one line. With the for
loop you will be populating convs
list with a lot more than one list of 4-digit strings. Does that help?
Upvotes: 1
Reputation: 82785
re
to find content between []
ast.literal_eval
to get list objectDemo:
import re
import ast
data = ["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
"u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
"u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']"]
res = []
for i in data:
val = re.findall(r"\[.*?\]", i)[0]
res.append(ast.literal_eval(val))
print(res)
Output:
[['L194', 'L195', 'L196', 'L197'], ['L198', 'L199'], ['L200', 'L201', 'L202', 'L203']]
Upvotes: 1
Reputation: 71461
You can use ast.literal_eval
and re
:
import re, ast
d = ["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']","u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']", "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']"]
new_d = [ast.literal_eval(re.findall('\[[\w\W]+\]', i)[0]) for i in d]
Output:
[['L194', 'L195', 'L196', 'L197'], ['L198', 'L199'], ['L200', 'L201', 'L202', 'L203']]
Upvotes: 1