mattrweaver
mattrweaver

Reputation: 789

Python/Pandas: creating new dataframe, gets error "unalignable boolean Series provided as indexer"

I am trying to compare two dataframes and return different result sets based on whether a value from one dataframe is present in the other.

Here is my sample code:

pmdf = pd.DataFrame(
        {
        'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
        'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
        }
        )

pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]

jcrdf = pd.DataFrame(
        {
        'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
        'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
        'Total Cites': ['223','444','324','166'],
        'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],   
        'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']                        
         })
jcrdf = jcrdf.set_index('Full Journal Title')

pmdf_issn = pmdf['ISSN'].values.tolist()

This line gets me the rows from dataframe jcrdf that contain the issn from dataframe pmdf

pmjcrmatch = jcrdf[jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]

I wanted the following line to create a new dataframe of values from pmdf where the ISSN is not in jcfdf so I negated the previous statement and chose the first dataframe.

pmjcrnomatch = pmdf[~jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]

I get an error: "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match"

I don't find a lot about this specific error, at least nothing that is helping me toward a solution.

Is "str.contains" not the best way of sorting items that are and aren't in the second dataframe?

Upvotes: 0

Views: 1662

Answers (1)

Jan Zeiseweis
Jan Zeiseweis

Reputation: 3738

You are trying to apply the boolean index of one dataframe to another. This is only possible if the length of both dataframes match. In your case you should use isin.

# get all rows from jcrdf where `ALL_ISSNs` contains any of the `ISSN` in `pmdf`.
pmjcrmatch = jcrdf[jcrdf.All_ISSNs.str.contains('|'.join(pmdf.ISSN))]
# assign all remaining rows from `jcrdf` to a new dataframe.
pmjcrnomatch = jcrdf[~jcrdf.ISSN.isin(pmjcrmatch.ISSN)]

EDIT Let's try another approach:

First i'd create a lookup for all you ISSNs and then create the diff by isolating the matches:

import pandas as pd

pmdf = pd.DataFrame(
        {
        'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
        'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
        }
        )

pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]

jcrdf = pd.DataFrame(
        {
        'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
        'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
        'Total Cites': ['223','444','324','166'],
        'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
        'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
         })
jcrdf = jcrdf.set_index('Full Journal Title')

# create lookup from all issns to avoid expansice string matching
jcrdf_lookup = pd.DataFrame(jcrdf['All_ISSNs'].str.split(',').tolist(),
                            index=jcrdf.ISSN).stack(level=0).reset_index(level=0)

# compare extracted ISSNs from ALL_ISSNs with pmdf.ISSN
matches = jcrdf_lookup[jcrdf_lookup[0].isin(pmdf.ISSN)]
jcrdfmatch = jcrdf[jcrdf.ISSN.isin(matches.ISSN)]
jcrdfnomatch = pmdf[~pmdf.ISSN.isin(matches[0])]

Upvotes: 1

Related Questions