Usage of str.contains() applied to pandas data frame

Question

I am new to Python and Jupyter Notebook and I am currently following this tutorial: https://www.dataquest.io/blog/jupyter-notebook-tutorial/. So far I've imported the pandas library and a couple other things, and I've made a data frame 'df' which is just a CSV file of company profit and revenue data. I'm having trouble understanding the following line of the tutorial:

non_numberic_profits = df.profit.str.contains('[^0-9.-]')

I understand the point of what the tutorial is doing: identifying all the companies whose profit variable contains a string instead of a number. But I don't understand the point of [^0-9.-] and how the above function actually works.

My full code is below. Thanks.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

df = pd.read_csv('fortune500.csv')
df.columns = ['year', 'rank', 'company', 'revenue', 'profit']
non_numberic_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numberic_profits].head()

Milo · Accepted Answer

The expression [^0-9.-] is a so-called regular expression, which is a special text string for describing a search pattern. With regular expressions (or in short 'RegEx') you can extract specific parts of a string. For example, you can extract foo from the string 123foo456.

In RegEx, when using [] you define a range of characters that has to be matched. For example, [bac] matches abc in the string abcdefg. [bac] could also be rewritten as [a-c].

Using [^] you can negate a character range. Thus, the RegEx [^a-c] applied to the above example would match defg.

Now here is a catch:
Since ^ and - have a special meaning when used in regular expressions, they have to be put in specific positions within [] in order to be matched literally. Specifically, if you want to match - literally and you want to exclude it from the character range, you have to put it at the rightmost end of [], for example [abc-].

Putting it all together
The RegEx '[^0-9.-]' means: 'Match all substrings that do not contain the digits 0 through 9, a dot (.) or a dash (-)'. You can see your regular expression applied to some example strings here.

The pandas function df.profit.str.contains('[^0-9.-]') checks whether the strings in the profit column of your DataFrame match this RegEx and returns True if they do and False if they don't. The result is a pandas Series containing the resulting True/False values.

If you're ever stuck, the Pandas docs are your friend. Stack Overflow's What Does this Regex Mean? and Regex 101 are also good places to start.

Usage of str.contains() applied to pandas data frame

Answers (1)

Related Questions