Reputation: 1
I have been given a rather large corpus of conversational data with which to import the relevant information into R and run some statistical analysis.
The problem is I do not need half the information provided in each entry. Each line in a specific JSON file from the dataset relates to a particular conversation of the nature A->B->A. The attributes provided are contained within a nested array for each of the respective statements in the conversation. This is best illustrated diagrammatically:
What I need is to simply extract the 'actual_sentence' attribute from each turn (turn_1,turn_2,turn_3 - aka A->B->A) and remove the rest.
So far my efforts have been in vain as I have been using the jsonlite package which seems to import the JSON fine but lacks the 'tree depth' to discern between the specific attributes of each turn.
An example:
The following is an example of one row/record of a provided JSON formatted .txt file:
{"semantic_distance_1": 0.375, "semantic_distance_2": 0.6486486486486487, "turn_2": "{\"sentence\": [\"A\", \"transmission\", \"?\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"AT\", null, \".\"], \"semantic_set\": [\"infection.n.04\", \"vitamin_a.n.01\", \"angstrom.n.01\", \"transmittance.n.01\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"adenine.n.01\", \"a.n.07\", \"a.n.06\", \"deoxyadenosine_monophosphate.n.01\"], \"additional_info\": [], \"original_sentence\": \"A transmission?\", \"actual_sentence\": \"A transmission?\", \"dependency_grammar\": null, \"actor\": \"standard\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 58}", "turn_3": "{\"sentence\": [\"A\", \"voice\", \"transmission\", \".\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"AT\", \"NN\", null, \".\"], \"semantic_set\": [\"vitamin_a.n.01\", \"voice.n.10\", \"voice.n.09\", \"angstrom.n.01\", \"articulation.n.03\", \"deoxyadenosine_monophosphate.n.01\", \"a.n.07\", \"a.n.06\", \"infection.n.04\", \"spokesperson.n.01\", \"transmittance.n.01\", \"voice.n.02\", \"voice.n.03\", \"voice.n.01\", \"voice.n.06\", \"voice.n.07\", \"voice.n.05\", \"voice.v.02\", \"voice.v.01\", \"part.n.11\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"adenine.n.01\"], \"additional_info\": [], \"original_sentence\": \"A voice transmission.\", \"actual_sentence\": \"A voice transmission.\", \"dependency_grammar\": null, \"actor\": \"computer\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 59}", "turn_1": "{\"sentence\": [\"I\", \"have\", \"intercepted\", \"a\", \"transmission\", \"of\", \"unknown\", \"origin\", \".\"], \"script_filename\": \"Alien.txt\", \"postag\": [\"PPSS\", \"HV\", \"VBD\", \"AT\", null, \"IN\", \"JJ\", \"NN\", \".\"], \"semantic_set\": [\"i.n.03\", \"own.v.01\", \"receive.v.01\", \"consume.v.02\", \"accept.v.02\", \"rich_person.n.01\", \"vitamin_a.n.01\", \"have.v.09\", \"have.v.07\", \"nameless.s.01\", \"have.v.01\", \"obscure.s.04\", \"have.v.02\", \"stranger.n.01\", \"angstrom.n.01\", \"induce.v.02\", \"hold.v.03\", \"wiretap.v.01\", \"give_birth.v.01\", \"a.n.07\", \"a.n.06\", \"deoxyadenosine_monophosphate.n.01\", \"infection.n.04\", \"unknown.n.03\", \"unknown.s.03\", \"get.v.03\", \"origin.n.03\", \"origin.n.02\", \"transmittance.n.01\", \"origin.n.05\", \"origin.n.04\", \"one.s.01\", \"have.v.17\", \"have.v.12\", \"have.v.10\", \"have.v.11\", \"take.v.35\", \"experience.v.03\", \"intercept.v.01\", \"unknown.n.01\", \"iodine.n.01\", \"strange.s.02\", \"suffer.v.02\", \"beginning.n.04\", \"one.n.01\", \"transmission.n.05\", \"transmission.n.02\", \"transmission.n.01\", \"ampere.n.02\", \"lineage.n.01\", \"unknown.a.01\", \"adenine.n.01\"], \"additional_info\": [], \"original_sentence\": \"I have intercepted a transmission of unknown origin.\", \"actual_sentence\": \"I have intercepted a transmission of unknown origin.\", \"dependency_grammar\": null, \"actor\": \"computer\", \"sentence_type\": null, \"ner\": {}, \"turn_in_file\": 57}", "syntax_distance_1": null, "syntax_distance_2": null}
As you can see there is a great deal of information that I do not need and given my poor knowledge of R, importing it (and the rest of the file it is contained within) in this form leads me to the following in R:
The command used for this was:
json = fromJSON(paste("[",paste(readLines("JSONfile.txt"),collapse=","),"]"))
Essentially it is picking up on syntax_distance_1, syntax_distance_2, semantic_distance_1,semantic_distance_2 and then lumping all of the turn data into three enormous and unstructured arrays.
What I would like to know is if I can somehow either:
OR
Hope that is enough information, please let me know if there is anything else I can add to clear it up.
Upvotes: 0
Views: 149
Reputation: 380
Since in this case you know that you need to go one level deeper, what you can do is use one of the apply functions to parse the turn_x strings. The following snippet of code illustrates the basic idea:
# Read the json file
json_file <- fromJSON("JSONfile.json")
# use the apply function to parse the turn_x strings.
# Checking that the element is a character helps avoid
# issues with numerical values and nulls.
pjson_file <- lapply(json_file, function(x) {if (is.character(x)){fromJSON(x)}})
If we look at the results, we see that the whole data structure has been parsed this time. To access the actual_sentence
field, what you can do is:
> pjson_file$turn_1$actual_sentence
[1] "I have intercepted a transmission of unknown origin."
> pjson_file$turn_2$actual_sentence
[1] "A transmission?"
> pjson_file$turn_3$actual_sentence
[1] "A voice transmission."
If you want to scale this logic so that it works with a large dataset, you can encapsulate it in a function that would return the three sentences as a character vector or a dataframe if you wish.
Upvotes: 1