Lorcán
Lorcán

Reputation: 555

Using regex to obtain the row index of a partial match within a pandas dataframe column

I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.

# An example of the dataframe and a test doi:

                                    ScID.xml     journal  year    topic1
0    0009-3570(2017)050[0199:omfg]2.3.co.xml  Journal_1   2017  0.000007
1  0001-3568(2001)750[0199:smdhmf]2.3.co.xml  Journal_3   2001  0.000648
2  0002-3568(2004)450[0199:gissaf]2.3.co.xml  Journal_1   2004  0.000003
3  0003-3568(2011)150[0299:easayy]2.3.co.xml  Journal_1   2011  0.000003

# A dummy doi:

test_doi = '0002-3568(2004)450'

In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.

I have searched this site and applied the methods described for similar scenarios.

Including:

df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()

This returns:

AttributeError: 'str' object has no attribute 'contains'

and:

df.filter(regex=test_doi)

gives:

Empty DataFrame

Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[287459 rows x 0 columns]

and finally:

df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]

which also returns the Empty DataFrame as above.

All help is appreciated. Thank you.

Upvotes: 2

Views: 2394

Answers (1)

P.Tillmann
P.Tillmann

Reputation: 2110

There are two reasons why your first approach does not work:

First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.

Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.

test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
                                    ScID.xml    journal  year    topic1
2  0002-3568(2004)450[0199:gissaf]2.3.co.xml  Journal_1  2004  0.000003

Bye the way, pandas filter function filters on the label of the index, not the values.

Upvotes: 5

Related Questions