Reputation: 182
I have multiple documents that together are approximately 400 GB and I want to convert them to json format in order to drop to elasticsearch for analysis.
Each file is approximately 200 MB.
Original file looked like:
IUGJHHGF@BERLIN:lhfrjy
0t7yfudf@WARSAW:qweokm246
0t7yfudf@CRACOW:Er747474
0t7yfudf@cracow:kui666666
000t7yf@Vienna:1йй2ц2й2цй2цц3у
It has the characters that are not only English. key1 is always separated with @, where city was separated either by ; or :
After I have parsed it with code:
#!/usr/bin/env python
# coding: utf8
import json
with open('2') as f:
for line in f:
s1 = line.find("@")
rest = line[s1+1:]
if rest.find(";") != -1:
if rest.find(":") != -1:
print "FOUND BOTH : ; "
s2 = -0
else:
s2 = s1+1+rest.find(";")
elif rest.find(":") != -1:
s2 = s1+1+rest.find(":")
else:
print "FOUND NO : ; "
s2 = -0
key1 = line[:s1]
city = line[s1+1:s2]
description = line[s2+1:len(line)-1]
All file looks like:
RRS12345 Cracow Sunflowers
RRD12345 Berin Data
After that parsing I want to have the output:
{
"location_data":[
{
"key1":"RRS12345",
"city":"Cracow",
"description":"Sunflowers"
},
{
"key1":"RRD123dsd45",
"city":"Berlin",
"description":"Data"
},
{
"key1":"RRD123dsds45",
"city":"Berlin",
"description":"1йй2ц2й2цй2цц3у"
}
]
}
How can I convert it to the required json format quickly, where we do not have only English characters?
Upvotes: 0
Views: 8314
Reputation: 76
import json
def process_text_to_json():
location_data = []
with open("file.txt") as f:
for line in f:
line = line.split()
location_data.append({"key1": line[0], "city": line[1], "description": line[2]})
location_data = {"location_data": location_data}
return json.dumps(location_data)
Output sample:
{"location_data": [{"city": "Cracow", "key1": "RRS12345", "description": "Sunflowers"}, {"city": "Berin", "key1": "RRD12345", "description": "Data"}, {"city": "Cracow2", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin2", "key1": "RRD12346", "description": "Data"}, {"city": "Cracow3", "key1": "RRS12346", "description": "Sunflowers"}, {"city": "Berin3", "key1": "RRD12346", "description": "Data"}]}
Upvotes: 3
Reputation: 82765
Iterate over each line and form your dict.
Ex:
d = {"location_data":[]}
with open(filename, "r") as infile:
for line in infile:
val = line.split()
d["location_data"].append({"key1": val[0], "city": val[1], "description": val[2]})
print(d)
Upvotes: 0