Lewis Wong
Lewis Wong

Reputation: 93

XML convert to JSON R

The packages in R doesn't seem to work properly in conversion of XML to JSON. I've tried RJSONIO, rjson and jsonlite with 'XML' package. I first parsed the XML and convert it into list using XML::xmlToList() and convert them to JSON using toJSON() from those 3 packages.

My XML file:

<?xml version="1.0" encoding="utf-8"?>
<votes>
  <row Id="1" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="2" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="3" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
</votes>

My source code:

library(XML)
library(RJSONIO)
library(rjson)
library(jsonlite)

xml_parse <- xmlTreeParse("~/Downloads/test.xml", useInternalNodes=TRUE)
xml_root <- xmlRoot(xml_parse)
xml_list <- xmlToList(xml_root, simplify = TRUE)

#jsonlite package
xml_jsonlite <- jsonlite::toJSON(xml_list)
write(xml_jsonlite, "test_jsonlite.json")

#RJSONIO package
xml_rjsonio <- RJSONIO::toJSON(xml_list)
write(xml_rjsonio, "test_rjsonio.json")

#rjson package
xml_rjson <- RJSONIO::toJSON(xml_list)
write(xml_rjson, "test_rjson.json")

Converted JSON file from RJSONIO:

{
"row": {
    "Id": "98",
    "PostId": "10",
    "VoteTypeId": "2",
    "CreationDate": "2014-05-14T00:00:00.000" 
 },
"row": {
    "Id": "99",
    "PostId": "7",
    "VoteTypeId": "5",
    "UserId": "111",
    "CreationDate": "2014-05-14T00:00:00.000" 
 }
}

Which is clearly wrong because of duplicate field name.

Converted JSON file from jsonlite:

{"row":["1","1","2","2014-05-13T00:00:00.000"],
 "row.1":["2","1","2","2014-05-13T00:00:00.000"],
 "row.2":["3","3","2","2014-05-13T00:00:00.000"]}

Which is weird because there should be only one field name "row" with array of subdocuments instead of incrementing array of "rows". It doesn't even have field names in it.

Converted JSON file from rjson:

{
 "row": {
 "Id": "1",
"PostId": "1",
"VoteTypeId": "2",
"CreationDate": "2014-05-13T00:00:00.000" 
},
"row": {
 "Id": "2",
"PostId": "1",
"VoteTypeId": "2",
"CreationDate": "2014-05-13T00:00:00.000" 
}
}

The ideal JSON file would be as such:

{"votes" : {
    "row" : [
        {
            "Id" : "1",
            "PostId" : "1",
            "VoteTypeId" : "2",
            "CreationDate" : "2014-05-13T00:00:00.000"
        },
        {
            "Id" : "2",
            "PostId" : "1",
            "VoteTypeId" : "2",
            "CreationDate" : "2014-05-13T00:00:00.000"
        }
       ]
      }
}

Looking for solution. Any help is appreciated.

Upvotes: 2

Views: 1794

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78792

xml2 and jsonlite get you most of the way there but you haven't even shown us you know R code let alone have really attempted a solution for this, so here's a partial solution posted so it can help others:

library(xml2)
library(jsonlite)

read_xml('<?xml version="1.0" encoding="utf-8"?>
<votes>
  <row Id="1" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="2" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="3" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
</votes>') -> doc

x <- xml2::as_list(doc) 

xl <- lapply(x, attributes)

toJSON(xl, pretty = TRUE, auto_unbox = TRUE)
## {
##   "row": {
##     "Id": "1",
##     "PostId": "1",
##     "VoteTypeId": "2",
##     "CreationDate": "2014-05-13T00:00:00.000"
##   },
##   "row.1": {
##     "Id": "2",
##     "PostId": "1",
##     "VoteTypeId": "2",
##     "CreationDate": "2014-05-13T00:00:00.000"
##   },
##   "row.2": {
##     "Id": "3",
##     "PostId": "3",
##     "VoteTypeId": "2",
##     "CreationDate": "2014-05-13T00:00:00.000"
##   }
## } 

Per your comment

What you want isn't how the data is structured. Which means if you want something, you can't use canned, vanilla utilities.

xml_find_all(doc, "//votes/row") %>% 
  map_chr(~{
    toJSON(as.list(xml_attrs(.x)), auto_unbox = TRUE, pretty = TRUE)
  }) %>% 
  paste0(collapse=",\n") %>% 
  gsub("[\n]", "\n    ", .) %>% 
  sprintf('{ "votes" : {\n  row" : [\n    %s]\n  }\n}', .) %>% 
  cat()

## { "votes" : {
##   row" : [
##     {
##       "Id": "1",
##       "PostId": "1",
##       "VoteTypeId": "2",
##       "CreationDate": "2014-05-13T00:00:00.000"
##     },
##     {
##       "Id": "2",
##       "PostId": "1",
##       "VoteTypeId": "2",
##       "CreationDate": "2014-05-13T00:00:00.000"
##     },
##     {
##       "Id": "3",
##       "PostId": "3",
##       "VoteTypeId": "2",
##       "CreationDate": "2014-05-13T00:00:00.000"
##     }]
##   }
## }

Upvotes: 8

Related Questions