Reputation: 331
I was looking for some approach in Python / Unix Command to shuffle large data set of text by grouping based on first words value like below-
Input Text:
"ABC", 21, 15, 45
"DEF", 35, 3, 35
"DEF", 124, 33, 5
"QQQ" , 43, 54, 35
"XZZ", 43, 35 , 32
"XZZ", 45 , 35, 32
So it would be randomly shuffled but keep the group together like below
Output Sample-
"QQQ" , 43, 54, 35
"XZZ", 43, 35 , 32
"XZZ", 45 , 35, 32
"ABC", 21, 15, 45
"DEF", 35, 3, 35
"DEF", 124, 33, 5
I found solution by normal shuffling, but I am not getting the idea to keep the group while shuffling.
Upvotes: 1
Views: 328
Reputation: 26315
You could also store each line from the file into a nested list:
lines = []
with open('input_text.txt') as in_file:
for line in in_file.readlines():
line = [x.strip() for x in line.strip().split(',')]
lines.append(line)
Which gives:
[['"ABC"', '21', '15', '45'], ['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5'], ['"QQQ"', '43', '54', '35'], ['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']]
Then you could group these lists by the first item with itertools.groupby()
:
import itertools
from operator import itemgetter
grouped = [list(g) for _, g in itertools.groupby(lines, key = itemgetter(0))]
Which gives a list of your grouped items:
[[['"ABC"', '21', '15', '45']], [['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5']], [['"QQQ"', '43', '54', '35']], [['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']]]
Then you could shuffle this with random.shuffle()
:
import random
random.shuffle(grouped)
Which gives a randomized list of your grouped items intact:
[[['"QQQ"', '43', '54', '35']], [['"ABC"', '21', '15', '45']], [['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']], [['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5']]]
And now all you have to do is flatten the final list and write it to a new file, which you can do with itertools.chain.from_iterable()
:
with open('output_text.txt', 'w') as out_file:
for line in itertools.chain.from_iterable(grouped):
out_file.write(', '.join(line) + '\n')
print(open('output_text.txt').read())
Which a gives new shuffled version of your file:
"QQQ", 43, 54, 35
"ABC", 21, 15, 45
"XZZ", 43, 35, 32
"XZZ", 45, 35, 32
"DEF", 35, 3, 35
"DEF", 124, 33, 5
Upvotes: 2
Reputation: 553
It is possible to do it using collections.defaultdict. By identifying each line by its first sequence you can sort through them easily and then only sample over the dictionary's keys, like so:
import random
from collections import defaultdict
# Read all the lines from the file
lines = defaultdict(list)
with open("/path/to/file", "r") as in_file:
for line in in_file:
s_line = line.split(",")
lines[s_line[0]].append(line)
# Randomize the order
rnd_keys = random.sample(lines.keys(), len(lines))
# Write back to the file?
with open("/path/to/file", "w") as out_file:
for k in rnd_keys:
for line in lines[k]:
out_file.write(line)
Hope this helps in your endeavor.
Upvotes: 3