Rechlay
Rechlay

Reputation: 1517

Making a list/vector of one column from different csv

I already loaded 20 csv files with function:

tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))

That how it looks like:

> head(tbl)
[1] "F1.csv"          "F10_noS3.csv"    "F11.csv"         "F12.csv"         "F12_noS7_S8.csv"
[6] "F13.csv"

In all of those csv files is a column called "Accession". I would like to make a list of all "names" inside those columns from each csv file. One big list.

Two problems:

Let me show you how it looks:

AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--

<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.

Is it possible to do ?

I couldn't do a dput(head) because it's even too big data set.

Upvotes: 1

Views: 93

Answers (2)

Richie Cotton
Richie Cotton

Reputation: 121077

The first trick: you can read all the tables into a list of data frames using lapply. This is easier to work with than 20 individual data frames.

tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)

The second trick: you can recombine this list into a single data frame using do.call in conjunction with rbind.

all_data = do.call(rbind, list_of_data)

You can select the contents of the Accession field before the dot using regular expressions. The stringr package is useful here. ^ represents the start of the string, [[:alnum:]] represents a letter or number (an alphanumeric character), and + means one or more.

library(stringr)
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")

Finally, you can remove duplicates by subsetting on non-duplicated values.

all_data = subset(all_data, !duplicated(CleanedAccession))

Upvotes: 3

jlhoward
jlhoward

Reputation: 59365

If you just need the list of names, and if they're all formatted as in your example, then using @Richie's all_data:

names <- unique(substr(all_data$Accession,0,9))

does it without regular expressions.

Upvotes: 0

Related Questions