Importing and converting specific attributes of JSON files in R

Question

I have been given a rather large corpus of conversational data with which to import the relevant information into R and run some statistical analysis.

The problem is I do not need half the information provided in each entry. Each line in a specific JSON file from the dataset relates to a particular conversation of the nature A->B->A. The attributes provided are contained within a nested array for each of the respective statements in the conversation. This is best illustrated diagrammatically:

What I need is to simply extract the 'actual_sentence' attribute from each turn (turn_1,turn_2,turn_3 - aka A->B->A) and remove the rest.

So far my efforts have been in vain as I have been using the jsonlite package which seems to import the JSON fine but lacks the 'tree depth' to discern between the specific attributes of each turn.

An example:

The following is an example of one row/record of a provided JSON formatted .txt file:

{"semantic_distance_1": 0.375, "semantic_distance_2": 0.6486486486486487, "turn_2": "{\"sentence\": [\"A\", \"transmission\", \"?\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"AT\", null, \".\"], \"semantic_set\": [\"infection.n.04\", \"vitamin_a.n.01\", \"angstrom.n.01\", \"transmittance.n.01\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"adenine.n.01\", \"a.n.07\", \"a.n.06\", \"deoxyadenosine_monophosphate.n.01\"], \"additional_info\": [], \"original_sentence\": \"A transmission?\", \"actual_sentence\": \"A transmission?\", \"dependency_grammar\": null, \"actor\": \"standard\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 58}", "turn_3": "{\"sentence\": [\"A\", \"voice\", \"transmission\", \".\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"AT\", \"NN\", null, \".\"], \"semantic_set\": [\"vitamin_a.n.01\", \"voice.n.10\", \"voice.n.09\", \"angstrom.n.01\", \"articulation.n.03\", \"deoxyadenosine_monophosphate.n.01\", \"a.n.07\", \"a.n.06\", \"infection.n.04\", \"spokesperson.n.01\", \"transmittance.n.01\", \"voice.n.02\", \"voice.n.03\", \"voice.n.01\", \"voice.n.06\", \"voice.n.07\", \"voice.n.05\", \"voice.v.02\", \"voice.v.01\", \"part.n.11\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"adenine.n.01\"], \"additional_info\": [], \"original_sentence\": \"A voice transmission.\", \"actual_sentence\": \"A voice transmission.\", \"dependency_grammar\": null, \"actor\": \"computer\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 59}", "turn_1": "{\"sentence\": [\"I\", \"have\", \"intercepted\", \"a\", \"transmission\", \"of\", \"unknown\", \"origin\", \".\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"PPSS\", \"HV\", \"VBD\", \"AT\", null, \"IN\", \"JJ\", \"NN\", \".\"], \"semantic_set\": [\"i.n.03\", \"own.v.01\", \"receive.v.01\", \"consume.v.02\", \"accept.v.02\", \"rich_person.n.01\", \"vitamin_a.n.01\", \"have.v.09\", \"have.v.07\", \"nameless.s.01\", \"have.v.01\", \"obscure.s.04\", \"have.v.02\", \"stranger.n.01\", \"angstrom.n.01\", \"induce.v.02\", \"hold.v.03\", \"wiretap.v.01\", \"give_birth.v.01\", \"a.n.07\", \"a.n.06\", \"deoxyadenosine_monophosphate.n.01\", \"infection.n.04\", \"unknown.n.03\", \"unknown.s.03\", \"get.v.03\", \"origin.n.03\", \"origin.n.02\", \"transmittance.n.01\", \"origin.n.05\", \"origin.n.04\", \"one.s.01\", \"have.v.17\", \"have.v.12\", \"have.v.10\", \"have.v.11\", \"take.v.35\", \"experience.v.03\", \"intercept.v.01\", \"unknown.n.01\", \"iodine.n.01\", \"strange.s.02\", \"suffer.v.02\", \"beginning.n.04\", \"one.n.01\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"lineage.n.01\", \"unknown.a.01\", \"adenine.n.01\"], \"additional_info\": [], \"original_sentence\": \"I have intercepted a transmission of unknown origin.\", \"actual_sentence\": \"I have intercepted a transmission of unknown origin.\", \"dependency_grammar\": null, \"actor\": \"computer\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 57}", "syntax_distance_1": null, "syntax_distance_2": null}

As you can see there is a great deal of information that I do not need and given my poor knowledge of R, importing it (and the rest of the file it is contained within) in this form leads me to the following in R:

The command used for this was:

json = fromJSON(paste("[",paste(readLines("JSONfile.txt"),collapse=","),"]"))

Essentially it is picking up on syntax_distance_1, syntax_distance_2, semantic_distance_1,semantic_distance_2 and then lumping all of the turn data into three enormous and unstructured arrays.

What I would like to know is if I can somehow either:

Specify a tree depth that enables R to discern between each of the 'turn' variables

OR

Simply cherry pick the turn$actual_sentence information from the outset and remove all the rest in the import process.

Hope that is enough information, please let me know if there is anything else I can add to clear it up.

Importing and converting specific attributes of JSON files in R

Answers (1)

Related Questions