Reputation: 726
The following regular-expressions: "O."
and "O[ACTG]"
yield different results when applied to the df str.match
method, as demonstrated below.
The former ("O."
) works as expected, but is unable to exclude data
on the 8th row from being overwritten.
The latter ("O[ACTG]"
), though it preserves the 8th row value, as expected, completely ignores the nested list structure and updates the data
column with individual values on the list.
df = pd.DataFrame({"find":['TO', 'GO', 'CO', 'AO', 'OT', 'OG', 'OC', 'OA', 'OO'],
"data":[[6,7],[5,6],[3,4],[1,2],[2,4,6,8],[2,4,6,8],[1,3,5,7],[1,3,5,7],"C"]})
# These lists are obtained from applying a filter to
# the dataframe subsets given by (A) and (B) below to
# remove [1,4,6,7] from the nested lists.
new_values_len5 = [[2, 8], [2, 8], [3, 5], [3, 5],[]]
new_values = [[2, 8], [2, 8], [3, 5], [3, 5]]
# I would like to apply the filter to only those rows
# in which the find column string is "O"-something
# (rather than something-"O") but not "OO".
# (A) this works...
df['data'][df.find.str.match("O.")] = new_values_len5
# find data
# 0 TO [6, 7]
# 1 GO [5, 6]
# 2 CO [3, 4]
# 3 AO [1, 2]
# 4 OT [2, 8]
# 5 OG [2, 8]
# 6 OC [3, 5]
# 7 OA [3, 5]
# 8 OO [] # I want to exclude "C" from being overwritten
# (B) but this doesn't
df['data'][df.find.str.match("O[ACTG]")] = new_values
# find data
# 0 TO [6, 7]
# 1 GO [5, 6]
# 2 CO [3, 4]
# 3 AO [1, 2]
# 4 OT 2 # !!nested structure in `new_values` is ignored.
# 5 OG 8
# 6 OC 2
# 7 OA 8
# 8 OO C # "C" is kept intact here but the nested list structure of `new_values` is destroyed.
I would like to know what causes this behavior and how to avoid it when applying regex to filter data in a dataframe.
Upvotes: 0
Views: 47
Reputation: 1513
df['data'][df.find.str.match("O(?!O)")] = df['data'][df.find.str.match("O(?!O)")].apply(lambda x: [i for i in x if i not in [1,4,6,7]])
Result:
find data
0 TO [6, 7]
1 GO [5, 6]
2 CO [3, 4]
3 AO [1, 2]
4 OT [2, 8]
5 OG [2, 8]
6 OC [3, 5]
7 OA [3, 5]
8 OO C
Upvotes: 1