Reputation: 735
I am new to Python and Jupyter Notebook and I am currently following this tutorial: https://www.dataquest.io/blog/jupyter-notebook-tutorial/. So far I've imported the pandas library and a couple other things, and I've made a data frame 'df' which is just a CSV file of company profit and revenue data. I'm having trouble understanding the following line of the tutorial:
non_numberic_profits = df.profit.str.contains('[^0-9.-]')
I understand the point of what the tutorial is doing: identifying all the companies whose profit variable contains a string instead of a number. But I don't understand the point of [^0-9.-] and how the above function actually works.
My full code is below. Thanks.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
df = pd.read_csv('fortune500.csv')
df.columns = ['year', 'rank', 'company', 'revenue', 'profit']
non_numberic_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numberic_profits].head()
Upvotes: 2
Views: 2490
Reputation: 3318
The expression [^0-9.-]
is a so-called regular expression, which is a special text string for describing a search pattern. With regular expressions (or in short 'RegEx') you can extract specific parts of a string. For example, you can extract foo
from the string 123foo456
.
In RegEx, when using []
you define a range of characters that has to be matched. For example, [bac]
matches abc
in the string abcdefg
. [bac]
could also be rewritten as [a-c]
.
Using [^]
you can negate a character range. Thus, the RegEx [^a-c]
applied to the above example would match defg
.
Now here is a catch:
Since ^
and -
have a special meaning when used in regular expressions, they have to be put in specific positions within []
in order to be matched literally. Specifically, if you want to match -
literally and you want to exclude it from the character range, you have to put it at the rightmost end of []
, for example [abc-]
.
Putting it all together
The RegEx '[^0-9.-]'
means: 'Match all substrings that do not contain the digits 0 through 9, a dot (.
) or a dash (-
)'. You can see your regular expression applied to some example strings here.
The pandas function df.profit.str.contains('[^0-9.-]')
checks whether the strings in the profit
column of your DataFrame match this RegEx and returns True
if they do and False
if they don't. The result is a pandas Series
containing the resulting True
/False
values.
If you're ever stuck, the Pandas docs are your friend. Stack Overflow's What Does this Regex Mean? and Regex 101 are also good places to start.
Upvotes: 3