Reputation: 85
I had two tab delimited files; one with the data, and the second one with the names of the columns that I am interested in. I want to subset the data frame so that it only has my columns of interest. Here is my code:
dat1 = DataFrame(CSV.File("data.txt"))
hdr = Symbol(readdlm("header.txt",'\t'))
which gives
julia> dat1
4×5 DataFrame
│ Row │ chr │ pos │ alt │ ref │ cadd │
│ │ String │ Int64 │ String │ String │ Float64 │
├─────┼────────┼───────┼────────┼────────┼─────────┤
│ 1 │ chr1 │ 1234 │ A │ T │ 23.4 │
│ 2 │ chr2 │ 1234 │ C │ G │ 5.4 │
│ 3 │ chr2 │ 1234 │ G │ C │ 11.0 │
│ 4 │ chr5 │ 3216 │ A │ T │ 3.0 │
julia> hdr
Symbol("Any[\"pos\" \"alt\"]")
However, I get an error if I try to subset with:
julia> dat2 = dat1[ :, :hdr]
What would be the correct way to subset? Thanks!
Upvotes: 3
Views: 147
Reputation: 69819
Just do:
hdr = vec(readdlm("header.txt",'\t'))
dat2 = dat1[:, hdr]
or for the second step
dat2 = select(df1, hdr)
What is important here is hat hdr
should be a vector of strings.
You could also have written:
dat2 = select(df1, readdlm("header.txt",'\t')...)
splatting the contents of the matrix (strings holding column names) as positional arguments.
Upvotes: 4