Reputation: 55
From my IOB corpus such as:
mention Tag
170
171 467 O
172
173 Vincennes B-LOCATION
174 . O
175
176 Confirmation O
177 des O
178 privilèges O
179 de O
180 la O
181 ville B-ORGANISATION
182 de I-ORGANISATION
183 Tournai I-ORGANISATION
184 1 O
185 ( O
186 cf O
187 . O
188 infra O
189 , O
I try to make simple statistics like total number of annotated mentions, total by labels etc.
After loading my dataset with pandas I got this:
df = pd.Series(data['Tag'].value_counts(), name="Total").to_frame().reset_index()
df.columns = ['Label', 'Total']
df
Output :
Label Total
0 O 438528
1 36235
2 B-LOCATION 378
3 I-LOCATION 259
4 I-PERSON 234
5 I-INSTALLATION 156
6 I-ORGANISATION 150
7 B-PERSON 144
8 B-TITLE 94
9 I-TITLE 89
10 B-ORGANISATION 68
11 B-INSTALLATION 62
12 I-EVENT 8
13 B-EVENT 2
First of all, How I could get a similar representation above but by regrouping the IOB prefixes such as (example):
Label, Total
PERSON, 300
LOCATION, 154
ORGANISATION, 67
etc.
and secondly how to exclude the "O" and empty strings labels from my output, I tested with .mask()
and .where()
on my Series but it fails.
Thank you for your leads.
Upvotes: 0
Views: 115
Reputation: 765
remove B-, I- parts, groupby, sum
df['label'] = df.label.str[2:]
df.groupby(['label']).sum()
For the second part, just return data in which the length of the label column string is greater than 2
df.loc[df.label.str.len()>2]
Upvotes: 1