Reputation: 1323
I have a dataframe
and the column
of interest in quite messy. I have tried to perform some matching strings
but it's not working.
The data comes from the website so the column is quite message. I am interested in a list structure of the numbers. Here is what the frame looks like:
169508 [3]
169509 [3, 4]
169510 [3]
169511 [3]
169512 [3]
169513 [6]
169514 [3, 4]
169515 [3, 4]
169516 [3, 4]
169517 [6]
169518 [6) AND ORD(MID((SELECT IFNULL(CAST(COUNT(DIST...
169519 [6) AND ORD(MID((SELECT IFNULL(CAST(COUNT(DIST...
169520 [6) AND ORD(MID((SELECT IFNULL(CAST(COUNT(DIST...
169521 [6) AND ORD(MID((SELECT IFNULL(CAST(COUNT(DIST...
169522 [6) AND 2831=IF((ORD(MID((SELECT IFNULL(CAST(C...
169523 [6) AND 2831=IF((ORD(MID((SELECT IFNULL(CAST(C...
169524 [6) AND 2831=IF((ORD(MID((SELECT IFNULL(CAST(C...
169525 [6]
169526 [6) AND 2831=IF((ORD(MID((SELECT IFNULL(CAST(C...
169527 [3]
169528 [2]
169529 [3, 4]
169530 [3]
169531 [6]
169532 [3, 4]
Here is the kind of ouput I am expecting:
type_prod
0 ['3']
1 ['6']
2 ['3','4']
3 ['3','4','6']
4 ['3','4']
5 ['6']
6 ['6']
7 ['5']
8 ['5']
9 ['3', '4', '1']
10 ['6', '2', '5', '1']
11 ['3', '4']
12 ['3', '4']
13 ['6', '2', '5', '1']
14 ['3', '4']
15 ['3', '4']
16 ['3', '4']
17 ['3', '4']
18 ['3', '4']
19 ['3', '4']
20 ['3', '4']
21 ['3', '4']
22 ['3', '4']
23 ['3', '4']
24 ['6']
25 ['6']
26 ['6']
Upvotes: 1
Views: 73
Reputation: 14103
you can use regex and apply but it is hard to tell when you only want one number returned in index 24 through 26. I am assuming that if the string contains 'AND' then just return the first number, which based on the example is character two.
import re
messydf['type_prod'].apply(lambda x: list(re.sub("[^0-9]", "", x)) if 'AND' not in x else list(x[2]))
from io import StringIO
import pandas as pd
import ast
import re
# prepare sample data
s = """index;type_prod
169508;[3]
169509;[3, 4]
169510;[3]
169511;[3]
169512;[3]
169513;[6]
169514;[3, 4]
169515;[3, 4]
169516;[3, 4]
169517;[6]
169518;['6) AND ORD(MID((SELECT IFNULL(CAST(COUNT(DIST...']
169519;['6) AND ORD(MID((SELECT IFNULL(CAST(COUNT(DIST...']
169520;['6) AND ORD(MID((SELECT IFNULL(CAST(COUNT(DIST...']
169521;['6) AND ORD(MID((SELECT IFNULL(CAST(COUNT(DIST...']
169522;['6) AND 2831=IF((ORD(MID((SELECT IFNULL(CAST(C...']
169523;['6) AND 2831=IF((ORD(MID((SELECT IFNULL(CAST(C...']
169524;['6) AND 2831=IF((ORD(MID((SELECT IFNULL(CAST(C...']
169525;[6]
169526;['6) AND 2831=IF((ORD(MID((SELECT IFNULL(CAST(C...']
169527;[3]
169528;[2]
169529;[3, 4]
169530;[3]
169531;[6]
169532;[3, 4]"""
messydf = pd.read_csv(StringIO(s), sep=';')
messydf = messydf.set_index('index')
messydf['type_prod'] = messydf['type_prod'].apply(ast.literal_eval)
# lambda function with re
messydf['type_prod'] = messydf['type_prod'].apply(lambda x: list(re.sub("[^0-9]", "", str(x)))\
if 'AND' not in str(x) else list(str(x)[2]))
print(messydf)
type_prod
index
169508 [3]
169509 [3, 4]
169510 [3]
169511 [3]
169512 [3]
169513 [6]
169514 [3, 4]
169515 [3, 4]
169516 [3, 4]
169517 [6]
169518 [6]
169519 [6]
169520 [6]
169521 [6]
169522 [6]
169523 [6]
169524 [6]
169525 [6]
169526 [6]
169527 [3]
169528 [2]
169529 [3, 4]
169530 [3]
169531 [6]
169532 [3, 4]
Upvotes: 1