Reputation: 1917

checking intersection of items in a list

let's say I have 6 lists or arrays. Each list has any amount of words.

   0   |    1   |    2   |    3    |    4    |   5    | ... N
-----------------------------------------------------------
  cat     dog      pine    tree       light    fan
  cat     dog      pine    tree       light    fan
  cat     dog      pine    tree       light    fan
  cat     dog      pine    tree       light    fan
  cat     dog      pine    tree       light    fan

I didn't feel like typing out all those words but let's say that I would like to get intersection. Finding the intersection of all is pretty simple and that can be done in python with a function like this:

all = set(zer0).intersection(one).intersection(two).intersection(...N)

I would like to make sure that I am not missing a simpler solution and not over thinking this.

For the above example, to get a match for any two lists I would need to do.

0&1, 0&2, 0&3, 0&4, 0&5, 0&..N

for three

0&1&2, 0&1&3, 0&1&4, 0&1&5, 0&1&..N

The reason I ask is because looking at the example with only two lists, what if array zero and array one doesn't contain similar words but array zero and three does?

Is there a way to generalize this, I have a strong feeling it has been solved and I am over thinking this question.

I'd like to be able to find out of say the word cat appears in 0,1, 2, ..N lists.

[EDIT] Here is some sample data that I am working with.

data0 = unicode("Rainforests are forests characterized by high rainfall, with annual rainfall between 250 and 450 centimetres (98 and 177 in).[1] There are two types of rainforest: tropical rainforest and temperate rainforest. The monsoon trough, alternatively known as the intertropical convergence zone, plays a significant role in creating the climatic conditions necessary for the Earth's tropical rainforests. Around 40% to 75% of all biotic species are indigenous to the rainforests.[2] It has been estimated that there may be many millions of species of plants, insects and microorganisms still undiscovered in tropical rainforests. Tropical rainforests have been called the \"jewels of the Earth\" and the \"world's largest pharmacy\", because over one quarter of natural medicines have been discovered there.[3] Rainforests are also responsible for 28% of the world's oxygen turnover, sometimes misnamed oxygen production,[4] processing it through photosynthesis from carbon dioxide and consuming it through respiration. The undergrowth in some areas of a rainforest can be restricted by poor penetration of sunlight to ground level. If the leaf canopy is destroyed or thinned, the ground beneath is soon colonized by a dense, tangled growth of vines, shrubs and small trees, called a jungle. The term jungle is also sometimes applied to tropical rainforests generally.", "utf-8")

data1 = unicode("Tropical rainforests are characterized by a warm and wet climate with no substantial dry season: typically found within 10 degrees north and south of the equator. Mean monthly temperatures exceed 18 °C (64 °F) during all months of the year.[5] Average annual rainfall is no less than 168 cm (66 in) and can exceed 1,000 cm (390 in) although it typically lies between 175 cm (69 in) and 200 cm (79 in).[6] Many of the world's tropical forests are associated with the location of the monsoon trough, also known as the intertropical convergence zone.[7] The broader category of tropical moist forests are located in the equatorial zone between the Tropic of Cancer and Tropic of Capricorn. Tropical rainforests exist in Southeast Asia (from Myanmar (Burma) to the Philippines, Malaysia, Indonesia, Papua New Guinea, Sri Lanka, Sub-Saharan Africa from Cameroon to the Congo (Congo Rainforest), South America (e.g. the Amazon Rainforest), Central America (e.g. Bosawás, southern Yucatán Peninsula-El Peten-Belize-Calakmul), Australia, and on many of the Pacific Islands (such as Hawaiʻi). Tropical forests have been called the \"Earth's lungs\", although it is now known that rainforests contribute little net oxygen addition to the atmosphere through photosynthesis", "utf-8")

data2 = unicode("Tropical forests cover a large part of the globe, but temperate rainforests only occur in few regions around the world. Temperate rainforests are rainforests in temperate regions. They occur in North America (in the Pacific Northwest in Alaska, British Columbia, Washington, Oregon and California), in Europe (parts of the British Isles such as the coastal areas of Ireland and Scotland, southern Norway, parts of the western Balkans along the Adriatic coast, as well as in Galicia and coastal areas of the eastern Black Sea, including Georgia and coastal Turkey), in East Asia (in southern China, Highlands of Taiwan, much of Japan and Korea, and on Sakhalin Island and the adjacent Russian Far East coast), in South America (southern Chile) and also in Australia and New Zealand.[10]", "utf-8")

I clean up the text, tokenize it into three lists, data0_list, ... data2_list.

after that a function call like this output the data.

master_list.append(data_0)
master_list.append(data_1)
master_list.append(data_2)

for item in master_list:
    for index, item in enumerate(item):
        print(index, item)

That output looks like this:

    =========== start data_0 ==============
(0, ((u'the',), 13))
(1, ((u'of',), 10))
(2, ((u'rainforests',), 7))
(3, ((u'and',), 7))
(4, ((u'tropical',), 5))
(5, ((u'to',), 4))
(6, ((u'rainforest',), 4))
(7, ((u'in',), 4))
(8, ((u'are',), 4))
(9, ((u'a',), 4))
(10, ((u'it',), 3))
(11, ((u'by',), 3))
(12, ((u'been',), 3))
(13, ((u's',), 3))
(14, ((u'is',), 3))
(15, ((u'there',), 3))
(16, ((u'have',), 2))
(17, ((u'earth',), 2))
(18, ((u'sometimes',), 2))
(19, ((u'also',), 2))
(20, ((u'oxygen',), 2))
(21, ((u'jungle',), 2))
(22, ((u'rainfall',), 2))
(23, ((u'for',), 2))
(24, ((u'through',), 2))
(25, ((u'called',), 2))
(26, ((u'be',), 2))
(27, ((u'world',), 2))
(28, ((u'species',), 2))
(29, ((u'ground',), 2))
(30, ((u'shrubs',), 1))
(31, ((u'may',), 1))
(32, ((u'biotic',), 1))
(33, ((u'from',), 1))
(34, ((u'respiration',), 1))
(35, ((u'known',), 1))
(36, ((u'largest',), 1))
(37, ((u'discovered',), 1))
(38, ((u'two',), 1))
(39, ((u'plants',), 1))
(40, ((u'conditions',), 1))
(41, ((u'insects',), 1))
(42, ((u'necessary',), 1))
(43, ((u'1',), 1))
(44, ((u'convergence',), 1))
(45, ((u'jewels',), 1))
(46, ((u'poor',), 1))
(47, ((u'estimated',), 1))
(48, ((u'if',), 1))
(49, ((u'creating',), 1))
(50, ((u'that',), 1))
(51, ((u'75',), 1))
(52, ((u'growth',), 1))
(53, ((u'penetration',), 1))
(54, ((u'thinned',), 1))
(55, ((u'has',), 1))
(56, ((u'characterized',), 1))
(57, ((u'plays',), 1))
(58, ((u'temperate',), 1))
(59, ((u'production',), 1))
(60, ((u'because',), 1))
(61, ((u'high',), 1))
(62, ((u'98',), 1))
(63, ((u'trough',), 1))
(64, ((u'centimetres',), 1))
(65, ((u'over',), 1))
(66, ((u'some',), 1))
(67, ((u'undiscovered',), 1))
(68, ((u'natural',), 1))
(69, ((u'still',), 1))
(70, ((u'misnamed',), 1))
(71, ((u'all',), 1))
(72, ((u'many',), 1))
(73, ((u'sunlight',), 1))
(74, ((u'millions',), 1))
(75, ((u'dioxide',), 1))
(76, ((u'around',), 1))
(77, ((u'28',), 1))
(78, ((u'monsoon',), 1))
(79, ((u'canopy',), 1))
(80, ((u'photosynthesis',), 1))
(81, ((u'level',), 1))
(82, ((u'177',), 1))
(83, ((u'trees',), 1))
(84, ((u'carbon',), 1))
(85, ((u'one',), 1))
(86, ((u'4',), 1))
(87, ((u'between',), 1))
(88, ((u'areas',), 1))
(89, ((u'responsible',), 1))
(90, ((u'as',), 1))
(91, ((u'vines',), 1))
(92, ((u'450',), 1))
(93, ((u'turnover',), 1))
(94, ((u'leaf',), 1))
(95, ((u'role',), 1))
(96, ((u'indigenous',), 1))
(97, ((u'can',), 1))
(98, ((u'with',), 1))
(99, ((u'types',), 1))
(100, ((u'alternatively',), 1))
(101, ((u'annual',), 1))
(102, ((u'generally',), 1))
(103, ((u'zone',), 1))
(104, ((u'beneath',), 1))
(105, ((u'significant',), 1))
(106, ((u'consuming',), 1))
(107, ((u'microorganisms',), 1))
(108, ((u'applied',), 1))
(109, ((u'soon',), 1))
(110, ((u'2',), 1))
(111, ((u'tangled',), 1))
(112, ((u'250',), 1))
(113, ((u'restricted',), 1))
(114, ((u'undergrowth',), 1))
(115, ((u'medicines',), 1))
(116, ((u'climatic',), 1))
(117, ((u'colonized',), 1))
(118, ((u'forests',), 1))
(119, ((u'dense',), 1))
(120, ((u'pharmacy',), 1))
(121, ((u'quarter',), 1))
(122, ((u'intertropical',), 1))
(123, ((u'term',), 1))
(124, ((u'or',), 1))
(125, ((u'destroyed',), 1))
(126, ((u'processing',), 1))
(127, ((u'3',), 1))
(128, ((u'small',), 1))
(129, ((u'40',), 1))
    =========== start data_1 ==============
(0, ((u'the',), 15))
(1, ((u'of',), 8))
(2, ((u'in',), 6))
(3, ((u'and',), 6))
(4, ((u'tropical',), 5))
(5, ((u'cm',), 4))
(6, ((u'to',), 3))
(7, ((u'are',), 3))
(8, ((u'rainforests',), 3))
(9, ((u'forests',), 3))
(10, ((u'south',), 2))
(11, ((u'from',), 2))
(12, ((u'it',), 2))
(13, ((u'g',), 2))
(14, ((u'no',), 2))
(15, ((u'known',), 2))
(16, ((u'rainforest',), 2))
(17, ((u'exceed',), 2))
(18, ((u'although',), 2))
(19, ((u'typically',), 2))
(20, ((u'america',), 2))
(21, ((u'e',), 2))
(22, ((u'many',), 2))
(23, ((u's',), 2))
(24, ((u'between',), 2))
(25, ((u'as',), 2))
(26, ((u'is',), 2))
(27, ((u'with',), 2))
(28, ((u'zone',), 2))
(29, ((u'congo',), 2))
(30, ((u'tropic',), 2))
(31, ((u'equatorial',), 1))
(32, ((u'within',), 1))
(33, ((u'located',), 1))
(34, ((u'convergence',), 1))
(35, ((u'now',), 1))
(36, ((u'el',), 1))
(37, ((u'by',), 1))
(38, ((u'saharan',), 1))
(39, ((u'average',), 1))
(40, ((u'lungs',), 1))
(41, ((u'less',), 1))
(42, ((u'64',), 1))
(43, ((u'have',), 1))
(44, ((u'degreef',), 1))
(45, ((u'temperatures',), 1))
(46, ((u'1',), 1))
(47, ((u'africa',), 1))
(48, ((u'earth',), 1))
(49, ((u'200',), 1))
(50, ((u'australia',), 1))
(51, ((u'18',), 1))
(52, ((u'peninsula',), 1))
(53, ((u'indonesia',), 1))
(54, ((u'that',), 1))
(55, ((u'390',), 1))
(56, ((u'been',), 1))
(57, ((u'10',), 1))
(58, ((u'characterized',), 1))
(59, ((u'also',), 1))
(60, ((u'yucatan',), 1))
(61, ((u'6',), 1))
(62, ((u'such',), 1))
(63, ((u'months',), 1))
(64, ((u'000',), 1))
(65, ((u'islands',), 1))
(66, ((u'trough',), 1))
(67, ((u'dry',), 1))
(68, ((u'66',), 1))
(69, ((u'equator',), 1))
(70, ((u'season',), 1))
(71, ((u'mean',), 1))
(72, ((u'sub',), 1))
(73, ((u'oxygen',), 1))
(74, ((u'degrees',), 1))
(75, ((u'7',), 1))
(76, ((u'rainfall',), 1))
(77, ((u'lanka',), 1))
(78, ((u'all',), 1))
(79, ((u'monthly',), 1))
(80, ((u'cancer',), 1))
(81, ((u'monsoon',), 1))
(82, ((u'asia',), 1))
(83, ((u'on',), 1))
(84, ((u'photosynthesis',), 1))
(85, ((u'degreec',), 1))
(86, ((u'southern',), 1))
(87, ((u'location',), 1))
(88, ((u'addition',), 1))
(89, ((u'sri',), 1))
(90, ((u'capricorn',), 1))
(91, ((u'southeast',), 1))
(92, ((u'warm',), 1))
(93, ((u'found',), 1))
(94, ((u'through',), 1))
(95, ((u'cameroon',), 1))
(96, ((u'climate',), 1))
(97, ((u'called',), 1))
(98, ((u'bosawas',), 1))
(99, ((u'pacific',), 1))
(100, ((u'69',), 1))
(101, ((u'5',), 1))
(102, ((u'can',), 1))
(103, ((u'burma',), 1))
(104, ((u'79',), 1))
(105, ((u'papua',), 1))
(106, ((u'annual',), 1))
(107, ((u'lies',), 1))
(108, ((u'atmosphere',), 1))
(109, ((u'substantial',), 1))
(110, ((u'new',), 1))
(111, ((u'168',), 1))
(112, ((u'category',), 1))
(113, ((u'moist',), 1))
(114, ((u'year',), 1))
(115, ((u'little',), 1))
(116, ((u'contribute',), 1))
(117, ((u'during',), 1))
(118, ((u'175',), 1))
(119, ((u'belize',), 1))
(120, ((u'wet',), 1))
(121, ((u'than',), 1))
(122, ((u'guinea',), 1))
(123, ((u'north',), 1))
(124, ((u'philippines',), 1))
(125, ((u'hawai\u02bbi',), 1))
(126, ((u'myanmar',), 1))
(127, ((u'world',), 1))
(128, ((u'peten',), 1))
(129, ((u'exist',), 1))
(130, ((u'net',), 1))
(131, ((u'a',), 1))
(132, ((u'broader',), 1))
(133, ((u'intertropical',), 1))
(134, ((u'calakmul',), 1))
(135, ((u'central',), 1))
(136, ((u'associated',), 1))
(137, ((u'malaysia',), 1))
(138, ((u'amazon',), 1))
    =========== start data_2 ==============
(0, ((u'in',), 11))
(1, ((u'the',), 9))
(2, ((u'and',), 9))
(3, ((u'of',), 7))
(4, ((u'temperate',), 3))
(5, ((u'southern',), 3))
(6, ((u'as',), 3))
(7, ((u'coastal',), 3))
(8, ((u'rainforests',), 3))
(9, ((u'east',), 2))
(10, ((u'parts',), 2))
(11, ((u'america',), 2))
(12, ((u'areas',), 2))
(13, ((u'british',), 2))
(14, ((u'coast',), 2))
(15, ((u'occur',), 2))
(16, ((u'regions',), 2))
(17, ((u'are',), 1))
(18, ((u'turkey',), 1))
(19, ((u'they',), 1))
(20, ((u'on',), 1))
(21, ((u'australia',), 1))
(22, ((u'far',), 1))
(23, ((u'oregon',), 1))
(24, ((u'galicia',), 1))
(25, ((u'chile',), 1))
(26, ((u'island',), 1))
(27, ((u'few',), 1))
(28, ((u'zealand',), 1))
(29, ((u'columbia',), 1))
(30, ((u'but',), 1))
(31, ((u'world',), 1))
(32, ((u'sea',), 1))
(33, ((u'taiwan',), 1))
(34, ((u'northwest',), 1))
(35, ((u'europe',), 1))
(36, ((u'10',), 1))
(37, ((u'much',), 1))
(38, ((u'also',), 1))
(39, ((u'north',), 1))
(40, ((u'adriatic',), 1))
(41, ((u'such',), 1))
(42, ((u'cover',), 1))
(43, ((u'forests',), 1))
(44, ((u'part',), 1))
(45, ((u'including',), 1))
(46, ((u'western',), 1))
(47, ((u'a',), 1))
(48, ((u'norway',), 1))
(49, ((u'large',), 1))
(50, ((u'georgia',), 1))
(51, ((u'well',), 1))
(52, ((u'south',), 1))
(53, ((u'globe',), 1))
(54, ((u'tropical',), 1))
(55, ((u'adjacent',), 1))
(56, ((u'washington',), 1))
(57, ((u'only',), 1))
(58, ((u'russian',), 1))
(59, ((u'pacific',), 1))
(60, ((u'japan',), 1))
(61, ((u'black',), 1))
(62, ((u'along',), 1))
(63, ((u'highlands',), 1))
(64, ((u'ireland',), 1))
(65, ((u'sakhalin',), 1))
(66, ((u'balkans',), 1))
(67, ((u'korea',), 1))
(68, ((u'asia',), 1))
(69, ((u'around',), 1))
(70, ((u'scotland',), 1))
(71, ((u'eastern',), 1))
(72, ((u'alaska',), 1))
(73, ((u'china',), 1))
(74, ((u'isles',), 1))
(75, ((u'new',), 1))
(76, ((u'california',), 1))

in this example the word rainforest, world, forest and some other more common words are in all three data sets.

What I am trying to do now is find the words that are in more than one list.

So for instance the I would like to be able to say the word rainforest is in 3/3 lists.

the word oxygen on the other hand is in 2/3 lists, it's in data_0 & data_1.

Upvotes: 1

Answers (5)

schesis

Reputation: 59168

Although the bulk of your question talks about intersections of sets, what you actually want appears not to be directly related to that concept:

I'd like to be able to find out of say the word cat appears in 0,1, 2, ..N lists.

You can find this out without bothering with intersections, sets and so on:

one = ['cat', 'dog', 'pine']
two = ['cat', 'fan', 'pine']
three = ['cat', 'pine', 'tree']
four = ['dog', 'pine', 'tree']
five = ['fan', 'pine', 'tree']
six = ['light', 'pine', 'tree']

>>> sum(True for s in (one, two, three, four, five, six) if 'cat' in s)
3
>>> sum(True for s in (one, two, three, four, five, six) if 'tree' in s)
4

This works because True acts like the integer 1 when used in arithmetic (on which sum() is based).

If what you actually want is the intersection of all the "sets", that's also straightforward:

>>> set.intersection(*(set(s) for s in (one, two, three, four, five, six)))
{'pine'}

Update: Now that you've clarified your problem, it's clear that you actually need to count how many times a word occurs across your various lists. Apart from the method described above to count occurrences of a single word, and as I mentioned in my comment on Andrea Reina's answer (and Moinuddin Quadri subsequently added to his own answer), an idiomatic way to do that in Python is with collections.Counter and itertools.chain:

>>> from collections import Counter
>>> from itertools import chain
>>> counts = Counter(chain(one, two, three, four, five, six))
>>> counts
Counter({'pine': 6, 'tree': 4, 'cat': 3, 'dog': 2, 'fan': 2, 'light': 1})
>>> counts['cat']
3

Upvotes: 1

Moinuddin Quadri

Reputation: 48090

Simplest way to find the intersection of multiple list is to use list slicing feature along with set.intersection(). For example:

my_list =[
    ['cat', 'dog', 'fan'],
    ['cat', 'dog', 'pine'],
    ['cat', 'light', 'tree', 'dog'],
    ['dog', 'pine', 'cat', 'tree'],
    ['fan', 'pine', 'dog', 'tree', 'cat'],
    ['light', 'dog', 'pine', 'cat', 'tree']]

Then intersection of all the list can be calculated as:

#                              v  Unwrapped list from index '1'
set(my_list[0]).intersection(*my_list[1:])
#           ^ First element in list

which will return:

set(['dog', 'cat'])

Edit: Looks like you do not need intersection. You need to find count of item in all the list based on statement:

I'd like to be able to find out of say the word cat appears in 0,1, 2, ..N lists.

If you care about just the count of items, you may use collections.Counter() along with itertools.chain() as:

from itertools import chain
from collections import Counter

my_count = Counter(chain(*my_list))

where my_count will hold:

{'dog': 6, 
 'cat': 6, 
 'tree': 4, 
 'pine': 4, 
 'light': 2, 
 'fan': 2}

If you also want a mapping of item with it's list, you can create dict to map the items. But, firstly you need union of all the items as:

all_items = set(my_list[0]).union(*my_list[1:])
# which will hold: set(['light', 'tree', 'dog', 'pine', 'cat', 'fan'])

Then store it in dict. I am using collections.defaultdict() for ease:

from collections import defaultdict
my_dict = defaultdict(list)

for item in all_items:
    for sub_list in my_list:
        my_dict[item].append(item in sub_list)

Now my_dict will hold value:

{
     'light': [False, False, True, False, False, True], 
     #          ^              ^ Present in list 3
     #          ^  Not present in list 1
     'tree': [False, False, True, True, True, True], 
     'dog': [True, True, True, True, True, True], 
     'pine': [False, True, False, True, True, True], 
     'cat': [True, True, True, True, True, True], 
     'fan': [True, False, False, False, True, False]
}

You can find the occurrence count from this dict.

Upvotes: 1

Andrea Reina

Reputation: 1240

Refining Zero Piraeus' answer:

one = ['cat', 'dog', 'fan']
two = ['cat', 'dog', 'pine']
three = ['cat', 'light', 'tree']
four = ['dog', 'pine', 'tree']
five = ['fan', 'pine', 'tree']
six = ['light', 'pine', 'tree']
lists = one + two + three + four + five + six

[(e, lists.count(e)) for e in set(lists)]
# => [('light', 2), ('tree', 4), ('dog', 3), ('pine', 4), ('cat', 3), ('fan', 2)]

Upvotes: 0

Reut Sharabani

Reputation: 31339

If you can use frozenset instead of list, this is a generalization:

To create all combinations we uses itertools.combinations:

from itertools import combinations

We will use frozenset which is hashable (later used as dict keys):

sets = tuple(frozenset(s) for s in (set([1,2,3]), set([2,3,4]), set([3,4,5]), set([4,5,6])))

Create the mappings by applying frozenset.intersections across all combinations (picked combinations of size 3 as an example), store results in a dict:

intersections = {frozenset(k): frozenset.intersection(*k) for k in combinations(sets, 3)}

Result:

{frozenset({frozenset({2, 3, 4}), frozenset({3, 4, 5}), frozenset({4, 5, 6})}): frozenset({4}), frozenset({frozenset({1, 2, 3}), frozenset({2, 3, 4}), frozenset({4, 5, 6})}): frozenset(), frozenset({frozenset({1, 2, 3}), frozenset({3, 4, 5}), frozenset({4, 5, 6})}): frozenset(), frozenset({frozenset({1, 2, 3}), frozenset({2, 3, 4}), frozenset({3, 4, 5})}): frozenset({3})}

Upvotes: 0

falsetru

Reputation: 369334

You can use reduce (functools.reduce if you are using Python 3.x)

>>> from functools import reduce  # for python 3.x
>>> animals_list = [
...     ['cat', 'dog', 'pine', 'tree', 'light', 'fan'],
...     ['cat', 'pine', 'tree', 'light', 'fan'],
...     ['cat', 'dog', 'pine', 'light', 'fan'],
...     ['cat', 'dog', 'pine', 'tree', 'fan'],
... ]
>>> reduce(lambda x, y: set(x).intersection(y), animals_list)
{'pine', 'fan', 'cat'}

Upvotes: 0

checking intersection of items in a list

Answers (5)

Related Questions