Reputation: 915
I am trying to display this dataset -> https://mtgjson.com/json/AllSets.json.zip
However I would like to flatten the data so that it isn't nested as a bunch of JSON data within a list, within a list, within a list.
More specifically, I am trying to display the data as a dataframe, ranked in order of $releaseDate
(one of the variables).
Here is my attempt so far:
library(jsonlite)
library(tidyjson)
mtgdata <- fromJSON("~/path/to/file.json")
The result of mtgdata presents this list of lists:
summary(mtgdata)
Length Class Mode
UST 9 -none- list
UNH 10 -none- list
UGL 11 -none- list
pWOS 8 -none- list
pWOR 8 -none- list
pWCQ 8 -none- list
pSUS 8 -none- list
pSUM 10 -none- list
pREL 8 -none- list
pPRO 8 -none- list
pPRE 8 -none- list
pPOD 7 -none- list
pMPR 8 -none- list
pMGD 8 -none- list
pMEI 8 -none- list
pLPA 8 -none- list
pLGM 8 -none- list
pJGP 10 -none- list
pHHO 11 -none- list
pWPN 8 -none- list
pGTW 8 -none- list
pGRU 10 -none- list
pGPX 8 -none- list
pFNM 10 -none- list
pELP 8 -none- list
pDRC 7 -none- list
pCMP 8 -none- list
pCEL 8 -none- list
pARL 8 -none- list
pALP 10 -none- list
p2HG 8 -none- list
p15A 8 -none- list
PD3 9 -none- list
PD2 9 -none- list
H09 9 -none- list
PTK 12 -none- list
POR 12 -none- list
PO2 13 -none- list
PCA 7 -none- list
PC2 10 -none- list
HOP 10 -none- list
VMA 9 -none- list
MMA 10 -none- list
MM3 8 -none- list
MM2 11 -none- list
MED 9 -none- list
ME4 9 -none- list
ME3 9 -none- list
ME2 9 -none- list
IMA 8 -none- list
EMA 9 -none- list
A25 8 -none- list
MPS_AKH 8 -none- list
MPS 9 -none- list
EXP 9 -none- list
E02 7 -none- list
V17 8 -none- list
V16 7 -none- list
V15 9 -none- list
V14 9 -none- list
V13 9 -none- list
V12 10 -none- list
V11 10 -none- list
V10 9 -none- list
V09 10 -none- list
DRB 9 -none- list
EVG 9 -none- list
DDT 7 -none- list
DDS 7 -none- list
DDR 7 -none- list
DDQ 8 -none- list
DDP 10 -none- list
DDO 10 -none- list
DDN 10 -none- list
DDM 10 -none- list
DDL 10 -none- list
DDK 10 -none- list
DDJ 10 -none- list
DDI 10 -none- list
DDH 10 -none- list
DDG 10 -none- list
DDF 10 -none- list
DDE 10 -none- list
DDD 9 -none- list
DDC 9 -none- list
DD3_JVC 9 -none- list
DD3_GVL 9 -none- list
DD3_EVG 9 -none- list
DD3_DVD 9 -none- list
DD2 11 -none- list
CNS 11 -none- list
CN2 9 -none- list
CMD 11 -none- list
CMA 7 -none- list
CM1 10 -none- list
C17 6 -none- list
C16 8 -none- list
C15 10 -none- list
C14 10 -none- list
C13 10 -none- list
CEI 9 -none- list
CED 9 -none- list
E01 7 -none- list
ARC 9 -none- list
ZEN 12 -none- list
XLN 12 -none- list
WWK 12 -none- list
WTH 13 -none- list
W17 8 -none- list
W16 8 -none- list
VIS 13 -none- list
VAN 8 -none- list
USG 13 -none- list
ULG 13 -none- list
UDS 13 -none- list
TSP 12 -none- list
TSB 12 -none- list
TPR 11 -none- list
TOR 12 -none- list
TMP 13 -none- list
THS 12 -none- list
STH 13 -none- list
SOM 12 -none- list
SOK 12 -none- list
SOI 10 -none- list
SHM 12 -none- list
SCG 12 -none- list
S99 11 -none- list
S00 11 -none- list
RTR 12 -none- list
RQS 6 -none- list
ROE 12 -none- list
RIX 12 -none- list
RAV 12 -none- list
PLS 13 -none- list
PLC 12 -none- list
PCY 13 -none- list
ORI 11 -none- list
ONS 12 -none- list
OGW 10 -none- list
ODY 13 -none- list
NPH 12 -none- list
NMS 14 -none- list
MRD 12 -none- list
MOR 12 -none- list
MMQ 13 -none- list
MIR 13 -none- list
MGB 10 -none- list
MD1 9 -none- list
MBS 12 -none- list
M15 11 -none- list
M14 11 -none- list
M13 11 -none- list
M12 11 -none- list
M11 11 -none- list
M10 11 -none- list
LRW 12 -none- list
LGN 12 -none- list
LEG 12 -none- list
LEB 11 -none- list
LEA 11 -none- list
KTK 12 -none- list
KLD 9 -none- list
JUD 12 -none- list
JOU 12 -none- list
ITP 11 -none- list
ISD 12 -none- list
INV 13 -none- list
ICE 13 -none- list
HOU 9 -none- list
HML 12 -none- list
GTC 12 -none- list
GPT 12 -none- list
FUT 12 -none- list
FRF_UGIN 10 -none- list
FRF 12 -none- list
FEM 11 -none- list
EXO 13 -none- list
EVE 12 -none- list
EMN 9 -none- list
DTK 12 -none- list
DST 12 -none- list
DRK 12 -none- list
DPA 9 -none- list
DKM 9 -none- list
DKA 12 -none- list
DIS 12 -none- list
DGM 12 -none- list
CST 11 -none- list
CSP 12 -none- list
CP3 7 -none- list
CP2 7 -none- list
CP1 7 -none- list
CON 13 -none- list
CHR 11 -none- list
CHK 12 -none- list
BTD 10 -none- list
BRB 10 -none- list
BOK 12 -none- list
BNG 12 -none- list
BFZ 12 -none- list
AVR 12 -none- list
ATQ 11 -none- list
ATH 9 -none- list
ARN 11 -none- list
ARB 12 -none- list
APC 13 -none- list
ALL 13 -none- list
ALA 12 -none- list
AKH 9 -none- list
AER 9 -none- list
9ED 12 -none- list
8ED 12 -none- list
7ED 12 -none- list
6ED 12 -none- list
5ED 12 -none- list
5DN 12 -none- list
4ED 12 -none- list
3ED 12 -none- list
2ED 11 -none- list
10E 11 -none- list
Within each of these lists are variables that I am interested in analysing to filter and sort this data as though it were a flattened dataframe.
When we inspect the list of variables within one of our lists (taking "mtgdata$UST" for example) we get this group of variables:
names(mtgdata$UST)
[1] "name" "code" "releaseDate" "border" "type"
"booster" "mkm_name"
[8] "mkm_id" "cards"
Running the same query on another list within mtgdata ("mtgdata$SOI") we get another group of variables, though they are mostly the same.
As I have mentioned above, I am primarily interested in flattening this dataset and ranking by mtgdata$releaseDate - but as it stands, the "$releaseDate" is currently nested within the first group of lists ("$UST" etc.)
Your help with this or how I might better rephrase this question would be greatly appreciated.
Upvotes: 1
Views: 3245
Reputation: 78832
You could try something like this on the command-line to convert the array of JSON objects into an file ndjson records then use something like ndjson::stream_in("filename_of the_thing_you_just_converted")
but you'll end up with a 14,000+ column, quite useless, "flat" data frame.
Instead, do some spelunking:
library(tidyverse)
as1 <- jsonlite::read_json("~/Downloads/AllSets.json")
str(as1, 1)
## List of 221
## $ UST :List of 9
## $ UNH :List of 10
## $ UGL :List of 11
## $ pWOS :List of 8
## $ pWOR :List of 8
## $ pWCQ :List of 8
## $ pSUS :List of 8
## $ pSUM :List of 10
## $ pREL :List of 8
## $ pPRO :List of 8
## $ pPRE :List of 8
## $ pPOD :List of 7
## $ pMPR :List of 8
## $ pMGD :List of 8
## $ pMEI :List of 8
## $ pLPA :List of 8
## $ pLGM :List of 8
## $ pJGP :List of 10
## $ pHHO :List of 11
## ...
Ugh…one of "those" JSON files that doesn't see fit to have all the elements of each record populated even when the entire file is — in theory — supposed to be consistent.
Let's see what JSON array elements have the most number of fields filled in since that means those likely have them all populated:
map_dbl(as1, length) %>%
broom::tidy() %>%
arrange(desc(x))
## # A tibble: 221 x 2
## names x
## <chr> <dbl>
## 1 NMS 14.0
## 2 PO2 13.0
## 3 WTH 13.0
## 4 VIS 13.0
## 5 USG 13.0
## 6 ULG 13.0
## 7 UDS 13.0
## 8 TMP 13.0
## 9 STH 13.0
## 10 PLS 13.0
## # ... with 211 more rows
Let's take a look at NMS
:
str(as1[["NMS"]], 1)
## List of 14
## $ name : chr "Nemesis"
## $ code : chr "NMS"
## $ gathererCode : chr "NE"
## $ magicCardsInfoCode: chr "ne"
## $ oldCode : chr "NEM"
## $ releaseDate : chr "2000-02-14"
## $ border : chr "black"
## $ type : chr "expansion"
## $ block : chr "Masques"
## $ booster :List of 15
## $ translations :List of 5
## $ mkm_name : chr "Nemesis"
## $ mkm_id : int 32
## $ cards :List of 143
You really don't want to flatten booster
, translations
or cards
and should keep them as list
columns and unnest
as necessary.
But, since each records has varying fields, we can't simply "data.table::rbindlist()or
dplyr::bind_rows()` since it's going to complain about a few of those columns.
We'll have to go record-by-record and convert each to a data frame, handling missing fields and wrapping the list
ones in a list()
. We'll simplify the functional idiom with a helper function to test for missing values:
`%l0%` <- function(x, y) if (length(x) > 0) x else y
^^ is a bit more robust than %||%
that comes along for the ride with purrr
.
Finally:
map_df(as1, ~{
data_frame(
name = .x$name %l0% NA_character_,
code = .x$code,
gathererCode = .x$gathererCode %l0% NA_character_,
magicCardsInfoCode = .x$magicCardsInfoCode %l0% NA_character_,
oldCode = .x$oldCode %l0% NA_character_,
releaseDate = .x$releaseDate %l0% NA_character_,
border = .x$border,
type = .x$type,
block = .x$block %l0% NA_character_,
booster = list(.x$booster),
translations = list(.x$translations),
mkm_name = .x$mkm_name %l0% NA_character_,
mkm_id = .x$mkm_id %l0% NA_character_,
cards = list(.x$cards)
)
}) -> all_sets
And, you can see the result:
all_sets
## # A tibble: 221 x 14
## name code gathererCode magicCardsInfoC… oldCode releaseDate border type block booster
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <list>
## 1 Unstable UST NA NA NA 2017-12-08 silver un NA <list […
## 2 Unhinged UNH NA uh NA 2004-11-20 silver un NA <list […
## 3 Unglued UGL UG ug NA 1998-08-11 silver un NA <list […
## 4 Wizards of th… pWOS NA wotc NA 1999-09-04 black promo NA <NULL>
## 5 Worlds pWOR NA wrl NA 1999-08-04 black promo NA <NULL>
## 6 World Magic C… pWCQ NA wmcq NA 2013-04-06 black promo NA <NULL>
## 7 Super Series pSUS NA sus NA 1999-12-01 black promo NA <NULL>
## 8 Summer of Mag… pSUM NA sum NA 2007-07-21 black promo NA <NULL>
## 9 Release Events pREL NA rep NA 2003-07-26 black promo NA <NULL>
## 10 Pro Tour pPRO NA pro NA 2007-02-09 black promo NA <NULL>
## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,
## # cards <list>
glimpse(all_sets)
## Observations: 221
## Variables: 14
## $ name <chr> "Unstable", "Unhinged", "Unglued", "Wizards of the Coast Online Store"...
## $ code <chr> "UST", "UNH", "UGL", "pWOS", "pWOR", "pWCQ", "pSUS", "pSUM", "pREL", "...
## $ gathererCode <chr> NA, NA, "UG", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ magicCardsInfoCode <chr> NA, "uh", "ug", "wotc", "wrl", "wmcq", "sus", "sum", "rep", "pro", "pt...
## $ oldCode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ releaseDate <chr> "2017-12-08", "2004-11-20", "1998-08-11", "1999-09-04", "1999-08-04", ...
## $ border <chr> "silver", "silver", "silver", "black", "black", "black", "black", "bla...
## $ type <chr> "un", "un", "un", "promo", "promo", "promo", "promo", "promo", "promo"...
## $ block <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ booster <list> [["rare", "uncommon", "uncommon", "uncommon", "common", "common", "co...
## $ translations <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NU...
## $ mkm_name <chr> "Unstable", "Unhinged", "Unglued", NA, NA, NA, NA, "Summer Magic", NA,...
## $ mkm_id <int> 1821, 59, 22, NA, NA, NA, NA, 76, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cards <list> [[["Andrea Radeck", 1, ["W"], ["White"], "95ebdf85f4ea74d584dfdfb72e3...
And, we can arrange them by releaseDate
after converting the column to a proper Date
object:
mutate(all_sets, releaseDate = lubridate::ymd(releaseDate)) %>%
arrange(desc(releaseDate))
## # A tibble: 221 x 14
## name code gathererCode magicCardsInfoCo… oldCode releaseDate border type block booster
## <chr> <chr> <chr> <chr> <chr> <date> <chr> <chr> <chr> <list>
## 1 Masters 25 A25 NA a25 NA 2018-03-16 black reprint NA <NULL>
## 2 Rivals of … RIX NA rix NA 2018-01-19 black expansi… Ixal… <list …
## 3 Unstable UST NA NA NA 2017-12-08 silver un NA <list …
## 4 Explorers … E02 NA e02 NA 2017-11-24 black board g… NA <NULL>
## 5 From the V… V17 NA v17 NA 2017-11-24 black from th… NA <NULL>
## 6 Iconic Mas… IMA NA ima NA 2017-11-17 black reprint NA <list …
## 7 Duel Decks… DDT NA ddt NA 2017-11-10 black duel de… NA <NULL>
## 8 Ixalan XLN NA xln NA 2017-09-29 black expansi… Ixal… <list …
## 9 Commander … C17 NA NA NA 2017-08-25 black command… NA <NULL>
## 10 Hour of De… HOU NA hou NA 2017-07-14 black expansi… Amon… <list …
## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,
## # cards <list>
Upvotes: 5