Reputation: 976
I am new to python and I want to convert a text file into json file. Here's how it looks like:
#Q Three of these animals hibernate. Which one does not?
^ Sloth
A Mouse
B Sloth
C Frog
D Snake
#Q What is the literal translation of the Greek word Embioptera, which denotes an order of insects, also known as webspinners?
^ Lively wings
A Small wings
B None of these
C Yarn knitter
D Lively wings
#Q There is a separate species of scorpions which have two tails, with a venomous sting on each tail.
^ False
A True
B False
Contd
.
.
.
.
^
means Answer.
I want it in json format as shown below. Example:
{
"questionBank": [
{
"question": "Grand Central Terminal, Park Avenue, New York is the worlds",
"a": "largest railway station",
"b": "Longest railway station",
"c": "highest railway station",
"d": "busiest railway station",
"answer": "largest railway station"
},
{
"question": "Eritrea, which became the 182nd member of the UN in 1993, is in the continent of",
"a": "Asia",
"b": "Africa",
"c": "Europe",
"d": "Oceania",
"answer": "Africa"
}, Contd.....
]
}
I came across a few similar posts and here's what I have tried:
dataset = "file.txt"
data = []
with open(dataset) as ds:
for line in ds:
line = line.strip().split(",")
print(line)
To which the output is:
['']
['#Q What part of their body do the insects from order Archaeognatha use to spring up into the air?']
['^ Tail']
['A Antennae']
['B Front legs']
['C Hind legs']
['D Tail']
['']
['#Q What is the literal translation of the Greek word Embioptera', ' which denotes an order of insects', ' also known as webspinners?']
['^ Lively wings']
['A Small wings']
['B None of these']
['C Yarn knitter']
['D Lively wings']
['']
Contd....
The sentences containing commas are separated by python lists. I tried to use .join but didn't get the results I was expecting.
Please let me know how to approach this.
Upvotes: 3
Views: 936
Reputation: 680
Rather than handling the lines one at a time, I went with using a regex pattern approach.
This also more reliable as it will error out if the input data is in a bad format - rather than silently ignoring a grouping which is missing a field.
PATTERN = r"""[#]Q (?P<question>.+)\n\^ (?P<answer>.+)\nA (?P<option_a>.+)\nB (?P<option_b>.+)\n(?:C (?P<option_c>.+)\n)?(?:D (?P<option_d>.+))?"""
def parse_qa_group(qa_group):
"""
Extact question, answer and 2 to 4 options from input string and return as a dict.
"""
# "group" here is a set of question, answer and options.
matches = PATTERN.search(qa_group)
# "group" here is a regex group.
question = matches.group('question')
answer = matches.group('answer')
try:
c = matches.group('option_c')
except IndexError:
c = None
try:
d = matches.group('option_d')
except IndexError:
d = None
results = {
"question": question,
"answer": answer,
"a": matches.group('option_a'),
"b": matches.group('option_b')
}
if c:
results['c'] = c
if d:
results['d'] = d
return results
# Split into groups using the blank line.
qa_groups = question_answer_str.split('\n\n')
# Process each group, building up a list of all results.
all_results = [parse_qa_group(qa_group) for qa_group in qa_groups]
print(json.dumps(all_results, indent=4))
Further details in my gist. Read more on regex Grouping
I left out reading the text and writing a JSON file.
Upvotes: 1
Reputation: 61
dataset = "text.txt"
question_bank = []
with open(dataset) as ds:
for i, line in enumerate(ds):
line = line.strip("\n")
if len(line) == 0:
question_bank.append(question)
question = {}
elif line.startswith("#Q"):
question = {"question": line}
elif line.startswith("^"):
question['answer'] = line.split(" ")[1]
else:
key, val = line.split(" ", 1)
question[key] = val
question_bank.append(question)
print({"questionBank":question_bank})
#for storing json file to local directory
final_output = {"questionBank":question_bank}
with open("output.json", "w") as outfile:
outfile.write(json.dumps(final_output, indent=4))
Upvotes: 4