Chuck
Chuck

Reputation: 1293

how to get row index of a Pandas dataframe from a regex match

This question has been asked but I didn't find the answers complete. I have a dataframe that has unnecessary values in the first row and I want to find the row index of the animals:

df = pd.DataFrame({'a':['apple','rhino','gray','horn'],
                   'b':['honey','elephant', 'gray','trunk'],
                   'c':['cheese','lion', 'beige','mane']})

       a         b       c
0  apple     honey  cheese
1  rhino  elephant    lion
2   gray      gray   beige
3   horn     trunk    mane

ani_pat = r"rhino|zebra|lion"

That means I want to find "1" - the row index that matches the pattern. One solution I saw here was like this; applying to my problem...

def findIdx(df, pattern):
    return df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).values.nonzero()

animal = findIdx(df, ani_pat)
print(animal)
(array([1, 1], dtype=int64), array([0, 2], dtype=int64))

That output is a tuple of NumPy arrays. I've got the basics of NumPy and Pandas, but I'm not sure what to do with this or how it relates to the df above.

I altered that lambda expression like this:

df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE))

       a      b      c
0  False  False  False
1   True  False   True
2  False  False  False
3  False  False  False

That makes a little more sense. but still trying to get the row index of the True values. How can I do that?

Upvotes: 0

Views: 43

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35676

We can select from the filter the DataFrame index where there are rows that have any True value in them:

idx = df.index[
    df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE)).any(axis=1)
]

idx:

Int64Index([1], dtype='int64')

any on axis 1 will take the boolean DataFrame and reduce it to a single dimension based on the contents of the rows.

Before any:

       a      b      c
0  False  False  False
1   True  False   True
2  False  False  False
3  False  False  False

After any:

0    False
1     True
2    False
3    False
dtype: bool

We can then use these boolean values as a mask for index (selecting indexes which have a True value):

Int64Index([1], dtype='int64')

If needed we can use tolist to get a list instead:

idx = df.index[
    df.apply(lambda x: x.str.match(ani_pat, flags=re.IGNORECASE)).any(axis=1)
].tolist()

idx:

[1]

Upvotes: 1

Related Questions