Reputation: 4808
I'm using the nltk
library's movie_reviews
corpus which contains a large number of documents. My task is get predictive performance of these reviews with pre-processing of the data and without pre-processing. But there is problem, in lists documents
and documents2
I have the same documents and I need shuffle them in order to keep same order in both lists. I cannot shuffle them separately because each time I shuffle the list, I get other results. That is why I need to shuffle the at once with same order because I need compare them in the end (it depends on order). I'm using python 2.7
Example (in real are strings tokenized, but it is not relative):
documents = [(['plot : two teen couples go to a church party , '], 'neg'),
(['drink and then drive . '], 'pos'),
(['they get into an accident . '], 'neg'),
(['one of the guys dies'], 'neg')]
documents2 = [(['plot two teen couples church party'], 'neg'),
(['drink then drive . '], 'pos'),
(['they get accident . '], 'neg'),
(['one guys dies'], 'neg')]
And I need get this result after shuffle both lists:
documents = [(['one of the guys dies'], 'neg'),
(['they get into an accident . '], 'neg'),
(['drink and then drive . '], 'pos'),
(['plot : two teen couples go to a church party , '], 'neg')]
documents2 = [(['one guys dies'], 'neg'),
(['they get accident . '], 'neg'),
(['drink then drive . '], 'pos'),
(['plot two teen couples church party'], 'neg')]
I have this code:
def cleanDoc(doc):
stopset = set(stopwords.words('english'))
stemmer = nltk.PorterStemmer()
clean = [token.lower() for token in doc if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
documents2 = [(list(cleanDoc(movie_reviews.words(fileid))), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle( and here shuffle documents and documents2 with same order) # or somehow
Upvotes: 131
Views: 128134
Reputation: 32189
You can do it as:
import random
a = ['a', 'b', 'c']
b = [1, 2, 3]
c = list(zip(a, b))
random.shuffle(c)
a, b = zip(*c)
print a
print b
[OUTPUT]
['a', 'c', 'b']
[1, 3, 2]
Of course, this was an example with simpler lists, but the adaptation will be the same for your case.
Upvotes: 319
Reputation: 153
This works as well:
import numpy as np
a = ['a', 'b', 'c']
b = [1, 2, 3]
rng = np.random.default_rng()
state = rng.bit_generator.state
rng.shuffle(a)
# use same seeds for a & b!
rng.bit_generator.state = state # set state to same state as before
rng.shuffle(b)
print(a)
print(b)
Output:
['b', 'a', 'c']
[2, 1, 3]
Upvotes: 0
Reputation: 1497
You can store the order of the values in a variable, then sort the arrays simultaneously:
array1 = [1, 2, 3, 4, 5]
array2 = ["one", "two", "three", "four", "five"]
order = range(len(array1))
random.shuffle(order)
newarray1 = []
newarray2 = []
for x in range(len(order)):
newarray1.append(array1[order[x]])
newarray2.append(array2[order[x]])
print newarray1, newarray2
Upvotes: 0
Reputation: 2027
from sklearn.utils import shuffle
a = ['a', 'b', 'c','d','e']
b = [1, 2, 3, 4, 5]
a_shuffled, b_shuffled = shuffle(np.array(a), np.array(b))
print(a_shuffled, b_shuffled)
#random output
#['e' 'c' 'b' 'd' 'a'] [5 3 2 4 1]
Upvotes: 21
Reputation: 59
Easy and fast way to do this is to use random.seed() with random.shuffle() . It lets you generate same random order many times you want. It will look like this:
a = [1, 2, 3, 4, 5]
b = [6, 7, 8, 9, 10]
seed = random.random()
random.seed(seed)
a.shuffle()
random.seed(seed)
b.shuffle()
print(a)
print(b)
>>[3, 1, 4, 2, 5]
>>[8, 6, 9, 7, 10]
This also works when you can't work with both lists at the same time, because of memory problems.
Upvotes: 5
Reputation: 870
I get a easy way to do this
import numpy as np
a = np.array([0,1,2,3,4])
b = np.array([5,6,7,8,9])
indices = np.arange(a.shape[0])
np.random.shuffle(indices)
a = a[indices]
b = b[indices]
# a, array([3, 4, 1, 2, 0])
# b, array([8, 9, 6, 7, 5])
Upvotes: 77
Reputation: 2025
Shuffle an arbitray number of lists simultaneously.
from random import shuffle
def shuffle_list(*ls):
l =list(zip(*ls))
shuffle(l)
return zip(*l)
a = [0,1,2,3,4]
b = [5,6,7,8,9]
a1,b1 = shuffle_list(a,b)
print(a1,b1)
a = [0,1,2,3,4]
b = [5,6,7,8,9]
c = [10,11,12,13,14]
a1,b1,c1 = shuffle_list(a,b,c)
print(a1,b1,c1)
Output:
$ (0, 2, 4, 3, 1) (5, 7, 9, 8, 6)
$ (4, 3, 0, 2, 1) (9, 8, 5, 7, 6) (14, 13, 10, 12, 11)
Note:
objects returned by shuffle_list()
are tuples
.
P.S.
shuffle_list()
can also be applied to numpy.array()
a = np.array([1,2,3])
b = np.array([4,5,6])
a1,b1 = shuffle_list(a,b)
print(a1,b1)
Output:
$ (3, 1, 2) (6, 4, 5)
Upvotes: 7
Reputation: 41
You can use the second argument of the shuffle function to fix the order of shuffling.
Specifically, you can pass the second argument of shuffle function a zero argument function which returns a value in [0, 1). The return value of this function fixes the order of shuffling. (By default i.e. if you do not pass any function as the second argument, it uses the function random.random()
. You can see it at line 277 here.)
This example illustrates what I described:
import random
a = ['a', 'b', 'c', 'd', 'e']
b = [1, 2, 3, 4, 5]
r = random.random() # randomly generating a real in [0,1)
random.shuffle(a, lambda : r) # lambda : r is an unary function which returns r
random.shuffle(b, lambda : r) # using the same function as used in prev line so that shuffling order is same
print a
print b
Output:
['e', 'c', 'd', 'a', 'b']
[5, 3, 4, 1, 2]
Upvotes: -2