Reputation: 1517
I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))
That how it looks like:
> head(tbl)
[1] "F1.csv" "F10_noS3.csv" "F11.csv" "F12.csv" "F12_noS7_S8.csv"
[6] "F13.csv"
In all of those csv files is a column called "Accession". I would like to make a list of all "names" inside those columns from each csv file. One big list.
Two problems:
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<--
= Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Is it possible to do ?
I couldn't do a dput(head)
because it's even too big data set.
Upvotes: 1
Views: 93
Reputation: 121077
The first trick: you can read all the tables into a list of data frames using lapply
. This is easier to work with than 20 individual data frames.
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
The second trick: you can recombine this list into a single data frame using do.call
in conjunction with rbind
.
all_data = do.call(rbind, list_of_data)
You can select the contents of the Accession
field before the dot using regular expressions. The stringr
package is useful here. ^
represents the start of the string, [[:alnum:]]
represents a letter or number (an alphanumeric character), and +
means one or more.
library(stringr)
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
Finally, you can remove duplicates by subsetting on non-duplicated
values.
all_data = subset(all_data, !duplicated(CleanedAccession))
Upvotes: 3
Reputation: 59365
If you just need the list of names, and if they're all formatted as in your example, then using @Richie's all_data:
names <- unique(substr(all_data$Accession,0,9))
does it without regular expressions.
Upvotes: 0