Reputation: 99
I have created a pandas DataFrame containing one string column. I want to copy some of its rows into a second DataFrame: just the rows where the characters before the first space are an integer greater than or equal to 300, and where the characters after the first space are "Broadway". In the following example, only the first row should be copied.
I would prefer to solve this problem without simply writing a boolean expression in straight Python. Let's pretend I wanted to convince someone of the benefits of using pandas rather than Python without pandas. Thank you very much.
d = {
"address": [
"300 Broadway", #Ok.
"300 Wall Street", #Sorry, not "Broadway".
"100-10 Broadway", #Sorry, "100-10" is not an integer.
"299 Broadway", #Sorry, 299 is less than 300.
"Broadway" #Sorry, no space at all.
]
}
df = pd.DataFrame(d)
df2 = df[what goes here?] #Broadway addresses greater than or equal to 300
print(df2)
Upvotes: 0
Views: 6188
Reputation: 61910
I think is best if you clean up your data a little first, for example:
# prepare data
df[['number', 'street']] = df.address.str.split('\s+', n=1, expand=True)
df['number'] = pd.to_numeric(df.number, errors='coerce')
The first line splits the address into number and street, the second converts the number into an actual integer, note that those values that are not integers will be converted to NaN
. Then you could do:
# create mask to filter
mask = df.number.ge(300) & df.street.str.contains("Broadway")
print(df[mask])
Basically create a boolean mask where number is greater than or equals to 300 and the street is Broadway. Putting all together, you have:
# prepare data
df[['number', 'street']] = df.address.str.split('\s+', n=1, expand=True)
df['number'] = pd.to_numeric(df.number, errors='coerce')
# create mask to filter
mask = df.number.eq(300) & df.street.str.contains("Broadway")
print(df[mask])
Output
address number street
0 300 Broadway 300.0 Broadway
Note that this solution assumes that your data has the pattern: Number Street
.
Upvotes: 1
Reputation: 42896
You can use str.contains
, str.extract
and ge
:
# rows which contain broadway
m1 = df['address'].str.contains('(?i)broadway')
# extract the numbers from the string and check if they are greater of equal to 300
m2 = df['address'].str.extract('(\d+)')[0].astype(float).ge(300)
# get all the rows which have True for both conditions
df[m1&m2]
Output
address
0 300 Broadway
Upvotes: 1