Sati
Sati

Reputation: 726

df.str.match with regex produces unexpected results

The following regular-expressions: "O." and "O[ACTG]" yield different results when applied to the df str.match method, as demonstrated below.

The former ("O.") works as expected, but is unable to exclude data on the 8th row from being overwritten.

The latter ("O[ACTG]"), though it preserves the 8th row value, as expected, completely ignores the nested list structure and updates the data column with individual values on the list.

df = pd.DataFrame({"find":['TO', 'GO', 'CO', 'AO', 'OT', 'OG', 'OC', 'OA', 'OO'],
"data":[[6,7],[5,6],[3,4],[1,2],[2,4,6,8],[2,4,6,8],[1,3,5,7],[1,3,5,7],"C"]})

# These lists are obtained from applying a filter to 
# the dataframe subsets given by (A) and (B) below to 
# remove [1,4,6,7] from the nested lists.
new_values_len5 = [[2, 8], [2, 8], [3, 5], [3, 5],[]]
new_values = [[2, 8], [2, 8], [3, 5], [3, 5]]


# I would like to apply the filter to only those rows 
# in which the find column string is "O"-something 
# (rather than something-"O") but not "OO".

# (A) this works...
df['data'][df.find.str.match("O.")] = new_values_len5

#   find    data
# 0   TO  [6, 7]
# 1   GO  [5, 6]
# 2   CO  [3, 4]
# 3   AO  [1, 2]
# 4   OT  [2, 8]
# 5   OG  [2, 8]
# 6   OC  [3, 5]
# 7   OA  [3, 5]
# 8   OO      []  # I want to exclude "C" from being overwritten


# (B) but this doesn't
df['data'][df.find.str.match("O[ACTG]")] = new_values

#   find    data
# 0   TO  [6, 7]
# 1   GO  [5, 6]
# 2   CO  [3, 4]
# 3   AO  [1, 2]
# 4   OT       2  # !!nested structure in `new_values` is ignored.
# 5   OG       8
# 6   OC       2
# 7   OA       8
# 8   OO       C  # "C" is kept intact here but the nested list structure of `new_values` is destroyed. 

I would like to know what causes this behavior and how to avoid it when applying regex to filter data in a dataframe.

Upvotes: 0

Views: 47

Answers (1)

Drakax
Drakax

Reputation: 1513

df['data'][df.find.str.match("O(?!O)")] = df['data'][df.find.str.match("O(?!O)")].apply(lambda x: [i for i in x if i not in [1,4,6,7]])

Result:

  find  data
0   TO  [6, 7]
1   GO  [5, 6]
2   CO  [3, 4]
3   AO  [1, 2]
4   OT  [2, 8]
5   OG  [2, 8]
6   OC  [3, 5]
7   OA  [3, 5]
8   OO  C

Upvotes: 1

Related Questions