Reputation: 133
I have recently been downloading large quantities of Tweets from Twitter. My starting point is around 400 .txt files containing Tweet IDs. After running a tool, Tweets are scraped from Twitter using the Tweet IDs and for every .txt file I had with a large list of Tweet IDs, I get a very large .txt file containing JSON strings. Each JSON string contains all of the information about the Tweet. Below is hyperlink to my one-drive, that contains the file I am working on (once I get this to work, I will apply the code to the other files):
https://1drv.ms/t/s!At39YLF-U90fhKAp9tIGJlMlU0qcNQ
I have been trying to parse each JSON string in each file but with no success. My aim is to convert each file into a large dataframe in R. Each row will be a Tweet and each column a feature in the Tweet. Given their nature, the 'text' column will be very large (it will contain the body of the tweet), whereas the 'location' will be short. Each JSON string is formatted in the same way and there can be up to a million strings per file.
I have tried several methods (shown below) to obtain what I need with no success:
library('RJSONIO')library('RCurl')
json_file <- fromJSON("Pashawar_test.txt")
json_file2 = RJSONIO::fromJSON(json_file)
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘fromJSON’ for signature ‘"list", "missing"’
My other attempt:
library('RJSONIO')
json_file <- fromJSON("Pashawar_test.txt")
text <- json_file[['text']]
idstr <- json_file[['id_str']]
This code seems to parse only the first JSON string in the file. I say this because when I attempt to select 'text' or 'id_str', I only get one instance. It's also worth pointing out that the 'json_file' is a large list that is 52.7mb in size, whereas the source file is 335mb.
Upvotes: 1
Views: 3651
Reputation: 78832
That's a [n]ewline [d]elimited [json] (ndjson) file which was tailor-made for the ndjson
package. Said package is very measurably faster than jsonlite::stream_in()
and produces a "completely flat" data frame. That latter part ("completely flat") isn't always what folks really need as it can make for a very wide structure (in your case 1,012 columns as it expanded all the nested components) but you get what you need fast without having to unnest anything on your own.
The output of str()
or even glimpse()
is too large to show here but this is how you use it.
NOTE that I renamed your file since .json.gz
is generally how ndjson is stored (and my package can handle gzip'd json files):
library(ndjson)
library(tidyverse)
twdf <- tbl_df(ndjson::stream_in("~/Desktop/pashwar-test.json.gz"))
## dim(twdf)
## [1] 75008 1012
Having said that…
I was alternatively going to suggest using Apache Drill since you have many of these files and they're relatively big. Drill would let you (ultimately) convert these to parquet and significantly speed things up, and there's a package to interface with Drill (sergeant
):
library(sergeant)
library(tidyverse)
db <- src_drill("dbserver")
twdf <- tbl(db, "dfs.json.`pashwar-test.json.gz`")
glimpse(twdf)
## Observations: 25
## Variables: 28
## $ extended_entities <chr> "{\"media\":[]}", "{\"media\":[]}", "{\"m...
## $ quoted_status <chr> "{\"entities\":{\"hashtags\":[],\"symbols...
## $ in_reply_to_status_id_str <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ created_at <chr> "Tue Dec 16 10:13:47 +0000 2014", "Tue De...
## $ in_reply_to_user_id_str <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ source <chr> "<a href=\"http://twitter.com/download/an...
## $ retweeted_status <chr> "{\"created_at\":\"Tue Dec 16 09:28:17 +0...
## $ quoted_status_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ retweet_count <int> 220, 109, 9, 103, 0, 398, 0, 11, 472, 88,...
## $ retweeted <chr> "false", "false", "false", "false", "fals...
## $ geo <chr> "{\"coordinates\":[]}", "{\"coordinates\"...
## $ is_quote_status <chr> "false", "false", "false", "false", "fals...
## $ in_reply_to_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ id_str <dbl> 5.447975e+17, 5.447975e+17, 5.447975e+17,...
## $ in_reply_to_user_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ favorite_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ id <dbl> 5.447975e+17, 5.447975e+17, 5.447975e+17,...
## $ text <chr> "RT @afneil: Heart-breaking beyond words:...
## $ place <chr> "{\"bounding_box\":{\"coordinates\":[]},\...
## $ lang <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ favorited <chr> "false", "false", "false", "false", "fals...
## $ possibly_sensitive <chr> NA, "false", NA, "false", NA, "false", NA...
## $ coordinates <chr> "{\"coordinates\":[]}", "{\"coordinates\"...
## $ truncated <chr> "false", "false", "false", "false", "fals...
## $ entities <chr> "{\"user_mentions\":[{\"screen_name\":\"a...
## $ quoted_status_id_str <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ user <chr> "{\"id\":25968369,\"id_str\":\"25968369\"...
BUT
you've managed to create really inconsistent JSON. Not all fields with nested content are consistently represented that way and newcomers to Drill will find it somewhat challenging to craft bulletproof SQL that will help them unnest that data across all scenarios.
If you only need the data from the "already flat" bits, give Drill a try.
If you need the nested data and don't want to fight with unnesting from jsonlite::stream_in()
or struggling with Drill unnesting, then, I'd suggest using ndjson
as noted in the first example and then carve out the bits you really need into more manageable, tidy data frames.
Upvotes: 1
Reputation: 24490
Try the stream_in
function of the jsonlite
package. Your file contains a JSON for each line. Either you read line by line and convert through fromJSON
or you use directly stream_in
, which is made for handling exactly this kind of files/connections.
require(jsonlite)
filepath<-"path/to/your/file"
#method A: read each line and convert
content<-readLines(filepath)
#this will take a while
res<-lapply(content,fromJSON)
#method B: use stream_in
con<-file(filepath,open="rt")
#this will take a while
res<-stream_in(con)
Notice that stream_in
will also simplify the result, coercing it to a data.frame
, which might be handier.
Upvotes: 2