Reputation: 113
I have the data in this format - Source Data
and I want to rearrange my data in the following format -
UNIQUE-ID TYPES COMMON-NAME CHEMICAL-FORMULA DBLINKS MOLECULAR-WEIGHT MONOISOTOPIC-MW
PYRIDINE-RING Rings pyridine-ring (C 5)(H 5)(N 1) NA 79.101 NA
STEARIC_ACID Even-St.. stearate (C 18)(H 35)(O 2) (BIGG "37799")..;(CHEMSPIDER "..); 283.473 284.2715304
BETA-HYDROX.. Compounds 3-β-hydro.. (C 19)(H 29)(O... (CHEBI "1724"..);(PUBCHEM "2520..); 369.495 370.1813984
OXALACETIC_A.. Compounds ... .... .... .... ...
.... .... ...
.... .... ...
So I want to arrange the data based on "UNIQUE-ID", "TYPES", "COMMON-NAME", "CHEMICAL-FORMULA", "MOLECULAR-WEIGHT" and "MONOISOTOPIC-MW". My code should consider the rows that are associated with the same id value as one single row and rearrange that information in a new dataframe - for example, every row that has id "1" in the original csv file should be rewritten in one row in the format I mentioned above.
Now the problem with the data is that for some of the "UNIQUE-ID" there are several "TYPES" (2, 3 or in some cases 5) and I want them to be separated by a semicolon(;) in a single cell in my output data like I have mentioned in the table above (see "DBLINKS" column for example)
ADDITIONAL INFORMATION :-
From the "DBLINKS" columns the main information I want to fetch is only "BIGG ID no.", "CHEMSPIDER ID no.", "CAS ID no.", "PUBCHEM ID no." and "CHEBI ID no." but I will figure that out later if it's so complex to implement.
I am not sure but I was thinking of using the reshape function in R for this but not sure about the use of parameters in my specific case.
Any help to achieve the desired result is genuinely appreciated. Thanks so much in advance.
Link to the sample data file - Sample Data
EDIT: Further Clarification :-
I want my final output to be like this (example is a made up dummy data not the actual data for the sake of simplicity to understand) -
UNIQUE-ID TYPES COMMON-NAME CHEMICAL-FORMULA DBLINKS MOLECULAR-WEIGHT MONOISOTOPIC-MW
ID-1 A C-Name-1 (C 5)(H 5)(N 1) Detail-1 79.101 NA
ID-2 B; C; D C-Name-2 (C 18)(H 35)(O 2) Detail-2; Detail-3 283.473 284.27
ID-3 E; F C-Name-3 (C 19)(H 29)(O 5)(S 1) Detail-2; Detail-1; Detail-5 369.495 370.18
from the following data format (csv file) that I have now. The source data looks something like this -
Variable Content Id
UNIQUE-ID ID-1 1
TYPES A 1
COMMON-NAME C-Name-1 1
CHEMICAL-FORMULA (C 5) 1
CHEMICAL-FORMULA (H 5) 1
CHEMICAL-FORMULA (N 1) 1
DBLINKS Detail-1 1
MOLECULAR-WEIGHT 79.101 1
UNIQUE-ID ID-2 2
TYPES B 2
TYPES C 2
TYPES D 2
COMMON-NAME C-Name-2 2
CHEMICAL-FORMULA (C 18) 2
CHEMICAL-FORMULA (H 35) 2
CHEMICAL-FORMULA (O 2) 2
DBLINKS Detail-2 2
DBLINKS Detail-3 2
MOLECULAR-WEIGHT 283.473 2
MONOISOTOPIC-MW 284.27 2
UNIQUE-ID ID-3 3
TYPES E 3
TYPES F 3
COMMON-NAME C-Name-3 3
CHEMICAL-FORMULA (C 19) 3
CHEMICAL-FORMULA (H 29) 3
CHEMICAL-FORMULA (O 5) 3
CHEMICAL-FORMULA (S 1) 3
DBLINKS Detail-2 3
DBLINKS Detail-1 3
DBLINKS Detail-5 3
MOLECULAR-WEIGHT 369.495 3
MONOISOTOPIC-MW 370.18 3
Upvotes: 1
Views: 47
Reputation: 1210
Here is how I would do it with tidyr
and dplyr
packages:
library(tidyr)
library(dplyr)
df <- read.table("test.txt", header = T)
df2 <- df %>% group_by(Variable, Id) %>%
summarise(Content2 = paste(Content, collapse = ";")) %>%
spread(key = Variable, value = Content2) %>%
select("UNIQUE-ID", TYPES, "COMMON-NAME", "CHEMICAL-FORMULA", DBLINKS,
"MOLECULAR-WEIGHT", "MONOISOTOPIC-MW")
df2
# A tibble: 3 x 7
`UNIQUE-ID` TYPES `COMMON-NAME` `CHEMICAL-FORMULA` DBLINKS `MOLECULAR-WEIGH~ `MONOISOTOPIC-M~
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ID-1 A C-Name-1 (C5);(H5);(N1) Detail-1 79.101 NA
2 ID-2 B;C;D C-Name-2 (C18);(H35);(O2) Detail-2;Detail-3 283.473 284.27
3 ID-3 E;F C-Name-3 (C19);(H29);(O5);(S1) Detail-2;Detail-1~ 369.495 370.18
If you want to learn more about each of the functions used here check out R for Data Science
Upvotes: 1