Betafish
Betafish

Reputation: 1262

Getting a value out of pandas dataframe based on a set of conditions

I have a dataframe as shown below

     Token          Label  StartID  EndID  
0   Germany         Country   0      2     
1   Berlin          Capital   6      9       
2   Frankfurt       City      15     18        
3   four   million  Number    21     24        
4   Sweden          Country   26     27       
5   United Kingdom  Country   32     34     
6   ten million     Number    40     45
7   London          Capital   50     55

I am trying to get row based on certain condition, i.e. associate the label Number to closest capital i.e. Berlin

3   four   million  Number    21     24   - > 1   Berlin          Capital   6      9

or something like:

df[row3] -> df [row1]

A pseudo logic

First check, for the rows with label: Number then (assumption is that the city is always '2 rows' above or below) and has the label: Capital. But, label: 'capital' loc is always after the label: Country

What I have done until now,

columnsName =['Token', 'Label', 'StartID', 'EndID']
df = pd.read_csv('resources/testcsv.csv', index_col= 0, skip_blank_lines=True, header=0)
print(df)

key_number = 'Number'
df_with_number = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_number), regex=True, case=False)])
print(df_with_number)

key_capital = 'Capital'
df_with_capitals = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_capital), regex=True, case=False)])
print(df_with_capitals)

key_country = 'Country'
df_with_country = (df[df[1].str.lower().str.contains(r"\b{}\b".format(key_country), regex=True, case=False)])
print(df_with_country)

The logic is to compare the index's and then make possible relations

i.e.

df[row3] -> [ df [row1], df[row7]]

Upvotes: 0

Views: 42

Answers (2)

It_is_Chris
It_is_Chris

Reputation: 14113

# adjusted sample data
s = """Token,Label,StartID,EndID  
Germany,Country,0,2     
Berlin,Capital,6,9       
Frankfurt,City,15,18        
four million,Number,21,24        
Sweden,Country,26,27       
United Kingdom,Country,32,34
ten million,Number,40,45
London,Capital,50,55
ten million,Number,40,45
ten million,Number,40,45"""
df = pd.read_csv(StringIO(s))

# create a mask for number where capital is 2 above or below
# and where country is three above number or one below number
mask = (df['Label'] == 'Number') & (((df['Label'].shift(2) == 'Capital') | 
                                     (df['Label'].shift(-2) == 'Capital')) & 
                                    (df['Label'].shift(3) == 'Country') | 
                                    (df['Label'].shift(-1) == 'Country'))

# create a mask for capital where number is 2 above or below
# and where country is one above capital
mask2 = (df['Label'] == 'Capital') & (((df['Label'].shift(2) == 'Number') | 
                                       (df['Label'].shift(-2) == 'Number')) & 
                                      (df['Label'].shift(1) == 'Country'))

# hstack your two masks and create a frame
new_df = pd.DataFrame(np.hstack([df[mask].to_numpy(), df[mask2].to_numpy()]))
print(new_df)

              0       1   2   3       4        5  6  7
0  four million  Number  21  24  Berlin  Capital  6  9

Upvotes: 1

Ben.T
Ben.T

Reputation: 29635

you could use merge_asof with the parameter direction=nearest for example:

df_nb_cap = pd.merge_asof(df_with_number.reset_index(), 
                          df_with_capitals.reset_index(), 
                          on='index', 
                          suffixes=('_nb', '_cap'), direction='nearest')
print (df_nb_cap)
   index      Token_nb Label_nb  StartID_nb  EndID_nb Token_cap Label_cap  \
0      3  four_million   Number          21        24    Berlin   Capital   
1      6   ten_million   Number          40        45    London   Capital   

   StartID_cap  EndID_cap  
0            6          9  
1           50         55 

Upvotes: 1

Related Questions