Reputation: 670
I am developing a function for calculating a number for a document based on evaluations on a dataset. I chose pandas since it seemed to be the most efficient way of using a big dataset.
My columns are: citing (identifiers), cited (identifiers), creation (string YYYY-MM or YYYY).
I need to add to a set all the identifiers of citing objects who meet the criteria of being created in year-1 or year-2.
I found this cool trick to subset a Dataframe through indexing: I save the indexed Dataframe to a local variable ('citing') and then use the .loc[identifier]['creation'] to get the value of that row at column creation. Thing is, this can either return a series (more than one identifier) or a string (just one value, so directly the creation date).
Since the value can either be in str(YYYY-MM) or str(YYYY) format I have to slice it with [:4] to make the actual comparison, plus.
I tried to do a conditional block based on datatype but something must've gone wrong, because what I print with my DEBUG lines is this:
DEBUG: 2014 is == to either 2015 or 2014
DEBUG: 2016 is == to either 2015 or 2014
DEBUG: 2018 is == to either 2015 or 2014
DEBUG: 2015 is == to either 2015 or 2014
i also tried to do a string comparison, turning the dates in str() and then comparing strings, unfortunately I got the same result
for identifier in ls:
citing = data.set_index('citing') # save data indexed by 'citing' column to local variable
try: # handle KeyError exception
creation = citing.loc[identifier]['creation'] # this can either be a str or a pandas series
if type(creation) == pandas.core.series.Series:
if int(creation.iloc[0][:4]) == (int(year))-1 or int(creation.iloc[0][:4]) == (int(year))-2:
print('DEBUG: ', creation.iloc[0][:4], 'is == to either {} or {}'.format(str(int(year)-1), str(int(year)-2)))
pub.add(identifier)
elif type(creation) == str:
if int(creation[:4]) == (int(year))-1 or (int(year))-2:
print('DEBUG: ', creation[:4], 'is == to either {} or {}'.format(str(int(year)-1), str(int(year)-2)))
pub.add(identifier)
except KeyError:
pass
this is really my first complex function in python, so some things may be obviously wrong or slow or inefficient, please be so kind as to spell them out for me so that I can improve my function! Thank you!
EDIT: sample input as pandas dataframe:
citing cited creation
0 1234 1235 2018-11
1 1237 1234 2017
2 1236 1237 2011-01
3 1234 1248 2018-11
4 1235 1236 2018-11
if the input were this Dataframe and the year 2018, the result set should only contain {1237} since it is the only one created in y-1 or y-2
Upvotes: 0
Views: 727
Reputation: 2534
You can locate all rows matching your criterias in (almost) a single shot. In fact, this is more efficient as you will compute the criteria against all rows in one shot, instead of looping over each values.
ix = df[
df.creation.astype(str).str[:4].astype(int).isin({year-1, year-2})
].index
identifiers = set(df.loc[ix, 'citing'])
pub |= identifiers
More explanations :
.astype(str)
-> make sure every value is of type str, even for years (just in case)
.str
-> string accessor of pandas, which will allow you to use string methods (more info here)
[:4]
-> string method, will allow you to capture the first 4 characters
.astype(int)
-> will cast the whole result to int (note that if you have rows with missing values, this may fail ; see a workaround below)
.isin(...)
-> will allow to see if the value (on each row) is inside (...)
You will get an "index", which can be used to filter the dataframe in one operation.
If you have missing values, you could start by using df['creation'].fillna("1000", inplace=True)
for example.
Upvotes: 1