Reputation: 647
I have a python list like below:
A = ['"','<bos>', 'What', 'colour', 'is', 'the', 'sky','<spec_token>' ,'(A)', 'red','<spec_token>', '(B)', 'blue', '<spec_token>','(C)', 'yellow','<eos>', '"']
For list A
, what is the easiest way to do the followings?
A_new = ['<bos>', 'What', 'colour', 'is', 'the', 'sky','<spec_token>' ,'(A)', 'red','<spec_token>', '(B)', 'blue', '<spec_token>','(C)', 'yellow','<eos>']
A
into 3 lists, one for each multiple choice option, i.e. the output should be like below:A_new_1 = ['<bos>', 'What', 'colour', 'is', 'the', 'sky','<spec_token>' ,'(A)', 'red']
A_new_2 = ['<bos>', 'What', 'colour', 'is', 'the', 'sky','<spec_token>' ,'(B)', 'blue']
A_new_3 = ['<bos>', 'What', 'colour', 'is', 'the', 'sky','<spec_token>' ,'(C)', 'yellow']
In my example, the ultimate goal is to get the lists A_new_1
, A_new_2
and A_new_3
.
I am currently working on making python function to achieve this objective, and my code so far is the following:
# 2. for GPT2MCHeadModel (ARC, openbookQA)
def GPT2MCHeadModel_data_manipulator(file_path):
f = open(file_path, "r")
ln = f.readline()
ln = ln.replace('"', '') # remove unnecessary quotation marks from the raw text file.
ln_split = ln.split()
# insert appropriate tokens into the raw text files before processing them in GPT2MCHeads model.
ln_split.insert(0, "<bos>")
ln_split.insert(len(ln_split) - 1, "<eos>")
ln_split.insert(ln_split.index("(A)"), "<mcOption>")
ln_split.insert(ln_split.index("(B)"), "<mcOption>")
ln_split.insert(ln_split.index("(C)"), "<mcOption>")
ln_split.insert(ln_split.index("(D)"), "<mcOption>")
and I am not sure how to separate the contents into 3 separate lists, one list for each multiple choice option.
Thank you,
Upvotes: 0
Views: 153
Reputation: 14546
Try the following:
A = ['"','<bos>', 'What', 'colour', 'is', 'the', 'sky','<spec_token>' ,'(A)', 'red','<spec_token>', '(B)', 'blue', '<spec_token>','(C)', 'yellow','<eos>', '"']
# Problem 1
A = [x for x in A if x != '"']
i = A.index("<spec_token>")
c = A.count("<spec_token>")
# Problem 2
output = [A[:i] + A[i+j*3:i+j*3+3] for j in range(c)]
Output
>>> A
['<bos>', 'What', 'colour', 'is', 'the', 'sky', '<spec_token>', '(A)', 'red', '<spec_token>', '(B)', 'blue', '<spec_token>', '(C)', 'yellow', '<eos>']
>>> output
[['<bos>', 'What', 'colour', 'is', 'the', 'sky', '<spec_token>', '(A)', 'red'],
['<bos>', 'What', 'colour', 'is', 'the', 'sky', '<spec_token>', '(B)', 'blue'],
['<bos>', 'What', 'colour', 'is', 'the', 'sky', '<spec_token>', '(C)', 'yellow']]
Upvotes: 1