Reputation: 77
I downloaded a dataset of over 2 million rows which looks like this (I have 10 columns/variables, starting at "messageid". I don't know what the "id" columns or variables?? are supposed to be but I don't need those for the analysis):
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
}
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
I tried the obvious before I realized that this was a "nested dataset":
test <- read.csv(test.csv, row.names=NULL)
This is the error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
If I don't use "row.names=NULL"
test <- read.csv(test.csv)
This is the error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
Does anyone know how I can read this nested dataset into R where every variable is a column and every observation is a row?
Upvotes: 0
Views: 73
Reputation: 10627
This file is of type json and not csv. You need to add []
to indicate it is a json list. Then you can enframe the tree into a table:
Content of file data.json
:
[{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
},
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
}
]
Code:
library(jsonlite)
library(tidyverse)
fromJSON("data.json") %>% as_tibble()
#> # A tibble: 2 x 9
#> `_id`$`$id` messageid attachments usernameid username server text type
#> <chr> <chr> <list> <chr> <chr> <chr> <chr> <chr>
#> 1 fh37fc3huc3 4757724838… <list [0]> 47284592942 Alex 475774… Must … 46205…
#> 2 23453532dwq 232534 <list [0]> 273342 <NA> 475774… https… 46205…
#> # … with 1 more variable: datetime <chr>
Created on 2021-11-09 by the reprex package (v2.0.1)
Upvotes: 0