Essan Rago
Essan Rago

Reputation: 77

How to transform a nested txt/json into a "proper" data frame in R

I downloaded a dataset of over 2 million rows which looks like this (I have 10 columns/variables, starting at "messageid". I don't know what the "id" columns or variables?? are supposed to be but I don't need those for the analysis):

{
    "_id": {
        "$id": "fh37fc3huc3"
    },
    "messageid": "4757724838492485088139042828",
    "attachments": [],
    "usernameid": "47284592942",
    "username": "Alex",
    "server": "475774810304151552",
    "text": "Must watch",
    "type": "462050823720009729",
    "datetime": "2018-08-05T21:20:20.486000+00:00",
    "type": {
        "$numberLong": "0"
    }
}

{
    "_id": {
        "$id": "23453532dwq"
    },
    "messageid": "232534",
    "attachments": [],
    "usernameid": "273342",
    "usernameid": "Alice",
    "server": "475774810304151552",
    "text": "https://www.youtube.com/",
    "type": "4620508237200097wd29",
    "datetime": "2018-08-05T21:20:11.803000+00:00",
    "type": {
        "$numberLong": "0"
    }

I tried the obvious before I realized that this was a "nested dataset":

test <- read.csv(test.csv, row.names=NULL)

This is the error message:

Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names

If I don't use "row.names=NULL"

test <- read.csv(test.csv)

This is the error message:

Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed

Does anyone know how I can read this nested dataset into R where every variable is a column and every observation is a row?

Upvotes: 0

Views: 73

Answers (1)

danlooo
danlooo

Reputation: 10627

This file is of type json and not csv. You need to add [] to indicate it is a json list. Then you can enframe the tree into a table:

Content of file data.json:

[{
  "_id": {
    "$id": "fh37fc3huc3"
  },
  "messageid": "4757724838492485088139042828",
  "attachments": [],
  "usernameid": "47284592942",
  "username": "Alex",
  "server": "475774810304151552",
  "text": "Must watch",
  "type": "462050823720009729",
  "datetime": "2018-08-05T21:20:20.486000+00:00",
  "type": {
    "$numberLong": "0"
  }
},

{
  "_id": {
    "$id": "23453532dwq"
  },
  "messageid": "232534",
  "attachments": [],
  "usernameid": "273342",
  "usernameid": "Alice",
  "server": "475774810304151552",
  "text": "https://www.youtube.com/",
  "type": "4620508237200097wd29",
  "datetime": "2018-08-05T21:20:11.803000+00:00",
  "type": {
    "$numberLong": "0"
  }
}
]

Code:

library(jsonlite)
library(tidyverse)

fromJSON("data.json") %>% as_tibble()
#> # A tibble: 2 x 9
#>   `_id`$`$id` messageid   attachments usernameid  username server  text   type  
#>   <chr>       <chr>       <list>      <chr>       <chr>    <chr>   <chr>  <chr> 
#> 1 fh37fc3huc3 4757724838… <list [0]>  47284592942 Alex     475774… Must … 46205…
#> 2 23453532dwq 232534      <list [0]>  273342      <NA>     475774… https… 46205…
#> # … with 1 more variable: datetime <chr>

Created on 2021-11-09 by the reprex package (v2.0.1)

Upvotes: 0

Related Questions