spideypack
spideypack

Reputation: 113

Making redundant rows names (variables) as columns and rearranging the required data in a csv file in R

I have the data in this format - Source Data

and I want to rearrange my data in the following format -

UNIQUE-ID       TYPES      COMMON-NAME       CHEMICAL-FORMULA           DBLINKS                           MOLECULAR-WEIGHT     MONOISOTOPIC-MW
PYRIDINE-RING   Rings     pyridine-ring       (C 5)(H 5)(N 1)            NA                               79.101               NA
STEARIC_ACID    Even-St.. stearate            (C 18)(H 35)(O 2)   (BIGG "37799")..;(CHEMSPIDER "..);      283.473              284.2715304
BETA-HYDROX..  Compounds  3-β-hydro..    (C 19)(H 29)(O...   (CHEBI "1724"..);(PUBCHEM "2520..);     369.495              370.1813984
OXALACETIC_A.. Compounds    ...                ....                ....                                   ....                 ...
....             ....       ...
....             ....       ...

So I want to arrange the data based on "UNIQUE-ID", "TYPES", "COMMON-NAME", "CHEMICAL-FORMULA", "MOLECULAR-WEIGHT" and "MONOISOTOPIC-MW". My code should consider the rows that are associated with the same id value as one single row and rearrange that information in a new dataframe - for example, every row that has id "1" in the original csv file should be rewritten in one row in the format I mentioned above.

Now the problem with the data is that for some of the "UNIQUE-ID" there are several "TYPES" (2, 3 or in some cases 5) and I want them to be separated by a semicolon(;) in a single cell in my output data like I have mentioned in the table above (see "DBLINKS" column for example)

ADDITIONAL INFORMATION :-

From the "DBLINKS" columns the main information I want to fetch is only "BIGG ID no.", "CHEMSPIDER ID no.", "CAS ID no.", "PUBCHEM ID no." and "CHEBI ID no." but I will figure that out later if it's so complex to implement.

I am not sure but I was thinking of using the reshape function in R for this but not sure about the use of parameters in my specific case.

Any help to achieve the desired result is genuinely appreciated. Thanks so much in advance.

Link to the sample data file - Sample Data

EDIT: Further Clarification :-

I want my final output to be like this (example is a made up dummy data not the actual data for the sake of simplicity to understand) -

UNIQUE-ID    TYPES       COMMON-NAME    CHEMICAL-FORMULA       DBLINKS                       MOLECULAR-WEIGHT    MONOISOTOPIC-MW

ID-1          A           C-Name-1      (C 5)(H 5)(N 1)        Detail-1                          79.101            NA   
ID-2          B; C; D     C-Name-2      (C 18)(H 35)(O 2)      Detail-2; Detail-3                283.473          284.27    
ID-3          E; F        C-Name-3      (C 19)(H 29)(O 5)(S 1) Detail-2; Detail-1; Detail-5      369.495          370.18

from the following data format (csv file) that I have now. The source data looks something like this -

Variable         Content       Id
UNIQUE-ID         ID-1          1
TYPES             A             1
COMMON-NAME       C-Name-1      1
CHEMICAL-FORMULA  (C 5)         1
CHEMICAL-FORMULA  (H 5)         1
CHEMICAL-FORMULA  (N 1)         1
DBLINKS           Detail-1      1
MOLECULAR-WEIGHT  79.101        1
UNIQUE-ID         ID-2          2
TYPES             B             2
TYPES             C             2
TYPES             D             2
COMMON-NAME       C-Name-2      2
CHEMICAL-FORMULA  (C 18)        2
CHEMICAL-FORMULA  (H 35)        2
CHEMICAL-FORMULA  (O 2)         2
DBLINKS           Detail-2      2
DBLINKS           Detail-3      2
MOLECULAR-WEIGHT  283.473       2
MONOISOTOPIC-MW   284.27        2
UNIQUE-ID         ID-3          3
TYPES             E             3
TYPES             F             3
COMMON-NAME       C-Name-3      3
CHEMICAL-FORMULA  (C 19)        3           
CHEMICAL-FORMULA  (H 29)        3
CHEMICAL-FORMULA  (O 5)         3
CHEMICAL-FORMULA  (S 1)         3
DBLINKS           Detail-2      3
DBLINKS           Detail-1      3
DBLINKS           Detail-5      3
MOLECULAR-WEIGHT  369.495       3
MONOISOTOPIC-MW   370.18        3

Upvotes: 1

Views: 47

Answers (1)

see24
see24

Reputation: 1210

Here is how I would do it with tidyr and dplyr packages:

library(tidyr)
library(dplyr)
df <- read.table("test.txt", header = T) 

df2 <- df %>% group_by(Variable, Id) %>% 
  summarise(Content2 = paste(Content, collapse = ";")) %>% 
  spread(key = Variable, value = Content2) %>% 
  select("UNIQUE-ID", TYPES, "COMMON-NAME", "CHEMICAL-FORMULA", DBLINKS, 
         "MOLECULAR-WEIGHT", "MONOISOTOPIC-MW")

df2
# A tibble: 3 x 7
  `UNIQUE-ID` TYPES `COMMON-NAME` `CHEMICAL-FORMULA`    DBLINKS            `MOLECULAR-WEIGH~ `MONOISOTOPIC-M~
  <chr>       <chr> <chr>         <chr>                 <chr>              <chr>             <chr>           
1 ID-1        A     C-Name-1      (C5);(H5);(N1)        Detail-1           79.101            NA              
2 ID-2        B;C;D C-Name-2      (C18);(H35);(O2)      Detail-2;Detail-3  283.473           284.27          
3 ID-3        E;F   C-Name-3      (C19);(H29);(O5);(S1) Detail-2;Detail-1~ 369.495           370.18

If you want to learn more about each of the functions used here check out R for Data Science

Upvotes: 1

Related Questions