dukes00
dukes00

Reputation: 57

How to replace substring in DataFrame column by variable pattern in Julia?

Suppose I have a DataFrame with two columns - gibberish and letter. I want to replace substrings in gibberish so that only the ones matching letter remain, e.g. If gibberish is "kjkkj" and the letter is "j" I want gibberish to equal "jj".

The DataFrame is defined as:

df = DataFrame(gibberish = ["dqzzzjbzz", "jjjvjmjjkjjjjjjj", "mmbmmlvmbmmgmmf"], letter = ["z", "j", "m"])

If I had no letter variable and wanted only, let's say "x" to remain I would do:

df.gibberish.= replace.(gibberish, r"[^x;]" => "")

and that works fine, but when I try doing the same, but putting in letter column as a variable in the regex expression, it just breaks. I tried doing that the "normal" DataFrames.jl way and with DataFramesMeta.jl shortcut @transform:

df.gibberish.= replace.(gibberish, Regex(join(["[^", letter, ";]"])) => "")

which results in an error of

ERROR: UndefVarError: letter not defined

while the @transform way just doesn't do anything:

julia> @transform(df, filtered = replace(:gibberish, Regex.(join(["[^", :letter, ";]"])) => ""))
3×3 DataFrame
│ Row  │ letter │ gibberish         │ filtered          │
│      │ String │ String            │ String            │
├──────┼────────┼───────────────────┼───────────────────┤
│ 1    │ z      │ dqzzzjbzz         │ dqzzzjbzz         │
│ 2    │ j      │ jjjvjmjjkjjjjjjj  │ jjjvjmjjkjjjjjjj  │
│ 3    │ m      │ mmbmmlvmbmmgmmf   │ mmbmmlvmbmmgmmf   │

I'm a very fresh beginner in Julia and I'm probably missing something very basic, but the proper solution just escapes me. How do I solve this problem, other than writing a rowwise loop which would be horribly inefficient?

Upvotes: 1

Views: 387

Answers (1)

lungben
lungben

Reputation: 1158

replace.(gibberish, Regex(join(["[^", letter, ";]"]))

letter refers here to a Julia variable (which is not defined), not to a column of the DataFrame.

You could try something like

Regex.(string.("[^" .* df.letter .* ";]"))

to construct an array of Regexes using a DataFrame row as input.

Upvotes: 1

Related Questions