Reputation: 149
I am working with fairly messy data: a tariff table with the following form:
import pandas as pd
import numpy as np
data1 = np.array([u'Free (A, B, KR, FR), 5% (JP)', u'Free (A, B, FR), 5% (JP, KR))'])
data2 = np.array(['10101010', '10101020'])
data = {'hscode': data2, 'tariff' : data1}
df = pd.DataFrame(data, columns=['hscode', 'tariff'])
The first row shows that the tariff is zero for countries (A, B, KR, FR) and 5% for JP, and the second row shows that it is zero for A, B, FR while 5 % for JP and KR.
I want to find the tariff rate of country 'KR' for each row, so that I could have the following table:
'hscode' 'tariff'
10101010 0%
10101020 5%
So, I want to find the tariff rate for the county code 'KR' in each cell.
Upvotes: 4
Views: 687
Reputation: 20553
You may use apply with regex:
## -- End pasted text --
In [133]: import re
In [134]: df
Out[134]:
hscode tariff
0 10101010 Free (A, B, KR, FR), 5% (JP)
1 10101020 Free (A, B, FR), 5% (JP, KR))
In [135]: df['tariff'].apply(lambda x: ''.join(re.findall(r'.*(Free|\d+%).*\bKR\b', x)))
Out[135]:
0 Free
1 5%
Name: tariff, dtype: object
Explain: within tariff, capture either "Free" or "x%" if string contains "KR".
You may create a function to dynamically set "KR" as a lookup variable.
Upvotes: 2
Reputation: 2414
import pandas as pd
import numpy as np
data1 = np.array([u'Free (A, B, KR, FR), 5% (JP)', u'Free (A, B, FR), 5% (JP, KR))'])
data2 = np.array(['10101010', '10101020'])
df = []
for i, element in enumerate(data1):
free, five = element.lstrip('Free (').rstrip(')').split('), 5% (')
for country in free.split(', '):
row = [data2[i], 'Free', country]
df.append(row)
for country in five.split(', '):
row = [data2[i], '5%', country]
df.append(row)
df = pd.DataFrame(df, columns = ['hscode', 'tariff', 'country'])
print df.query('country == "KR"')
gives
hscode tariff country
2 10101010 Free KR
9 10101020 5% KR
Upvotes: 0