Python int comparison not working properly in pandas

Question

I am developing a function for calculating a number for a document based on evaluations on a dataset. I chose pandas since it seemed to be the most efficient way of using a big dataset. My columns are: citing (identifiers), cited (identifiers), creation (string YYYY-MM or YYYY).
I need to add to a set all the identifiers of citing objects who meet the criteria of being created in year-1 or year-2. I found this cool trick to subset a Dataframe through indexing: I save the indexed Dataframe to a local variable ('citing') and then use the .loc[identifier]['creation'] to get the value of that row at column creation. Thing is, this can either return a series (more than one identifier) or a string (just one value, so directly the creation date).
Since the value can either be in str(YYYY-MM) or str(YYYY) format I have to slice it with [:4] to make the actual comparison, plus. I tried to do a conditional block based on datatype but something must've gone wrong, because what I print with my DEBUG lines is this:

DEBUG: 2014 is == to either 2015 or 2014
DEBUG: 2016 is == to either 2015 or 2014
DEBUG: 2018 is == to either 2015 or 2014
DEBUG: 2015 is == to either 2015 or 2014

i also tried to do a string comparison, turning the dates in str() and then comparing strings, unfortunately I got the same result

for identifier in ls:
    citing = data.set_index('citing')  # save data indexed by 'citing' column to local variable
        try:                               # handle KeyError exception

            creation = citing.loc[identifier]['creation']  # this can either be a str or a pandas series

            if type(creation) == pandas.core.series.Series:
                if int(creation.iloc[0][:4]) == (int(year))-1 or int(creation.iloc[0][:4]) == (int(year))-2:
                    print('DEBUG: ', creation.iloc[0][:4], 'is == to either {} or {}'.format(str(int(year)-1), str(int(year)-2)))
                    pub.add(identifier)

            elif type(creation) == str:
                if int(creation[:4]) == (int(year))-1 or (int(year))-2:
                    print('DEBUG: ', creation[:4], 'is == to either {} or {}'.format(str(int(year)-1), str(int(year)-2)))
                    pub.add(identifier)

        except KeyError:
            pass

this is really my first complex function in python, so some things may be obviously wrong or slow or inefficient, please be so kind as to spell them out for me so that I can improve my function! Thank you!

EDIT: sample input as pandas dataframe:

 citing    cited    creation
0  1234  1235  2018-11 
1  1237  1234  2017     
2  1236  1237  2011-01
3  1234  1248  2018-11
4  1235  1236  2018-11

if the input were this Dataframe and the year 2018, the result set should only contain {1237} since it is the only one created in y-1 or y-2

tgrandje · Accepted Answer

You can locate all rows matching your criterias in (almost) a single shot. In fact, this is more efficient as you will compute the criteria against all rows in one shot, instead of looping over each values.

ix = df[
    df.creation.astype(str).str[:4].astype(int).isin({year-1, year-2})
  ].index
identifiers = set(df.loc[ix, 'citing'])
pub |= identifiers

More explanations :

.astype(str) -> make sure every value is of type str, even for years (just in case)

.str -> string accessor of pandas, which will allow you to use string methods (more info here)

[:4] -> string method, will allow you to capture the first 4 characters

.astype(int) -> will cast the whole result to int (note that if you have rows with missing values, this may fail ; see a workaround below)

.isin(...) -> will allow to see if the value (on each row) is inside (...)

You will get an "index", which can be used to filter the dataframe in one operation.

If you have missing values, you could start by using df['creation'].fillna("1000", inplace=True) for example.

Python int comparison not working properly in pandas

Answers (1)

Related Questions