brian
brian

Reputation: 85

Julia: Subset data frame

I had two tab delimited files; one with the data, and the second one with the names of the columns that I am interested in. I want to subset the data frame so that it only has my columns of interest. Here is my code:

dat1 = DataFrame(CSV.File("data.txt"))
hdr = Symbol(readdlm("header.txt",'\t'))

which gives

julia> dat1
4×5 DataFrame
│ Row │ chr    │ pos   │ alt    │ ref    │ cadd    │
│     │ String │ Int64 │ String │ String │ Float64 │
├─────┼────────┼───────┼────────┼────────┼─────────┤
│ 1   │ chr1   │ 1234  │ A      │ T      │ 23.4    │
│ 2   │ chr2   │ 1234  │ C      │ G      │ 5.4     │
│ 3   │ chr2   │ 1234  │ G      │ C      │ 11.0    │
│ 4   │ chr5   │ 3216  │ A      │ T      │ 3.0     │

julia> hdr
Symbol("Any[\"pos\" \"alt\"]")

However, I get an error if I try to subset with:

julia> dat2 = dat1[ :, :hdr]

What would be the correct way to subset? Thanks!

Upvotes: 3

Views: 147

Answers (1)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69819

Just do:

hdr = vec(readdlm("header.txt",'\t'))
dat2 = dat1[:, hdr]

or for the second step

dat2 = select(df1, hdr)

What is important here is hat hdr should be a vector of strings.

You could also have written:

dat2 = select(df1, readdlm("header.txt",'\t')...)

splatting the contents of the matrix (strings holding column names) as positional arguments.

Upvotes: 4

Related Questions