Reputation: 31
I'm dealing with patent data with pandas and numpy. the steps that I've done and data I've got from the raw data is below.
code
title = df['title'].tolist()
cpc = df_cpu['cpc'].tolist()
z = zip(title, cpc)
result
('(real-time information transmission system)',
'A61B-0005/0002, A61B-0005/0001, A61B-0005/0021'),
('(skincare counselling system)',
'G06Q-0050/0010'),
('(apparatus for monitoring posture)',
'A61B-0005/1116, A61B-0005/0002'),,....
)
It's a basically list(or tuple) with 'titles of patent' and it's own 'cpc codes' defining where sub technology the patents belongs to . In this case, I'd like to split(or should I say reshape) the data I've got as I wrote below. I guess it is not just split the data but reshape with specific rules.
('(real-time information transmission system)',
'A61B-0005/0002'),
'(real-time information transmission system)',
'A61B-0005/0001')
'(real-time information transmission system)',
'A61B-0005/0021')
('(skincare counselling system)',
'G06Q-0050/0010'),
('(apparatus for monitoring posture)',
'A61B-0005/1116')
('(apparatus for monitoring posture)',
'A61B-0005/0002'),,....
)
I thought about counting each commas and copy titles by the number of commas but I guess there should be more easy way to do it and I don't even know how to do with the way I thought.
Upvotes: 0
Views: 71
Reputation: 86
If I understood the end goal correctly, you want to use split()
to split the cpc codes string, using ','
as the separator. This will generate a list, which you can then iterate through to create a new list/tuple.
Here is a snippet that I think accomplishes what you want:
from pprint import pprint
z = (('(real-time information transmission system)', 'A61B-0005/0002, A61B-0005/0001, A61B-0005/0021'),
('(skincare counselling system)', 'G06Q-0050/0010'),
('(apparatus for monitoring posture)', 'A61B-0005/1116, A61B-0005/0002'))
new_z = []
for title, cpc_codes_str in z:
cpc_codes = cpc_codes_str.split(',')
for code in cpc_codes:
new_z.append((title, code))
pprint(tuple(new_z))
and this is what is printed:
(('(real-time information transmission system)', 'A61B-0005/0002'),
('(real-time information transmission system)', ' A61B-0005/0001'),
('(real-time information transmission system)', ' A61B-0005/0021'),
('(skincare counselling system)', 'G06Q-0050/0010'),
('(apparatus for monitoring posture)', 'A61B-0005/1116'),
('(apparatus for monitoring posture)', ' A61B-0005/0002'))
Hope this helps.
Upvotes: 1