Steven González
Steven González

Reputation: 257

How to create Pandas DF columns based on "keys" within json file?

I have a "json file" with the following objects: combinedRecognizedPhrases recognizedPhrases
json example:

{
  "source": "https://example.com",
  "timestamp": "2021-04-12T19:34:24Z",
  "durationInTicks": 1082400000,
  "duration": "PT1M48.24S",
  "combinedRecognizedPhrases": [
    {
      "channel": 0,
      "lexical": "aaa",
      "itn": "aaa",
      "maskedITN": "aaa",
      "display": "aaa"
    }
  ],
  "recognizedPhrases": [
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "speaker": 1,
      "offset": "PT2.18S",
      "duration": "PT3.88S",
      "offsetInTicks": 21800000,
      "durationInTicks": 38800000,
      "nBest": [
        {
          "confidence": 0.9306252,
          "lexical": "gracias por llamar",
          "itn": "gracias por llamar",
          "maskedITN": "gracias por llamar",
          "display": "¿Gracias por llamar",
          "words": [
            {
              "word": "gracias",
              "offset": "PT2.18S",
              "duration": "PT0.37S",
              "offsetInTicks": 21800000,
              "durationInTicks": 3700000,
              "confidence": 0.930625
            },
            {
              "word": "por",
              "offset": "PT2.55S",
              "duration": "PT0.18S",
              "offsetInTicks": 25500000,
              "durationInTicks": 1800000,
              "confidence": 0.930625
            },
            {
              "word": "llamar",
              "offset": "PT2.73S",
              "duration": "PT0.22S",
              "offsetInTicks": 27300000,
              "durationInTicks": 2200000,
              "confidence": 0.930625
            }
          ]
        }
      ]
    },
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "speaker": 2,
      "offset": "PT6.85S",
      "duration": "PT5.63S",
      "offsetInTicks": 68500000,
      "durationInTicks": 56300000,
      "nBest": [
        {
          "confidence": 0.9306253,
          "lexical": "quiero hacer un pago",
          "itn": "quiero hacer un pago",
          "maskedITN": "quiero hacer un pago",
          "display": "quiero hacer un pago"
        }
      ]
    },
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "speaker": 2,
      "offset": "PT13.29S",
      "duration": "PT3.81S",
      "offsetInTicks": 132900000,
      "durationInTicks": 38100000,
      "nBest": [
        {
          "confidence": 0.93062526,
          "lexical": "no sé bien la cantidad",
          "itn": "no sé bien la cantidad",
          "maskedITN": "no sé bien la cantidad",
          "display": "no sé bien la cantidad"
        }
      ]
    }
  ]
}

In the example the recognizedPhrases object has values from 0 to 2. Each of these values has information describing it:

"recognitionStatus": "Success",
      "channel": 0,
      "speaker": 1,
      "offset": "PT2.18S",
      "duration": "PT3.88S",
      "offsetInTicks": 21800000,
      "durationInTicks": 38800000

There is also an object inside recognizedPhrases called nBest which contains the following information:

"confidence": 0.9306252,
          "lexical": "thank you for calling",
          "itn": "thank you for calling",
          "maskedITN": "thank you for calling",
          "display": "thank you for calling".

I need to organize the information available in each recognizedPhrases/[0] or 1 or 2 etc. /nBest/[1]/display In a DF with a column named "speaker 1" when speaker=1 and another column when "speaker 2".

For example: If recognizedPhrases/[0] object contains "speaker": 1 and recognizedPhrases/[1] also has "speaker": 1 those phrases should be concatenated in the speaker 1 column of my df.

Edit 1: I have already tried the following:

with open('file.json','r') as f:
    j = json.load(f)    
test = pd.json_normalize(j, record_path=['recognizedPhrases'], meta=['source', 'durationInTicks', 'duration'], record_prefix='_')

which give me the following DF:

![5e9cfce03f44bcfe8f6082cbaae6664b.png](:/e17e94b5aff54ba386a03f485bb3218a)

The problem with this DF is that it has one row for each time each speaker speaks. In the example I am using, speaker1 speaks once, and speaker 2 speaks twice, this code would generate 3 rows, and that is not what I am looking for. In addition, what each speaker says is inside the _nBest dictionary and requires additional code to get what was said.

What I am trying to get is a DF where all the information is in a single row. Here is an example of what I am looking for: ![7c416edadf263776eb2f0dfe99e67c07.png](:/ff1745362d5c470c942c927653def6b9)

Comment on @DSteman answer: This approach does one good thing and that is it allows me to separate speakers. However, there are two things I need to improve. First, this approach creates two rows. I need all the information to be in one row. Second, that in the speaker 1 column there is something that speaker 2 said.

![274da23c5bb1e583c9752513fc5ba713.png](:/9f344824ffb3462e82e7d29aae016921)

Third, with this approach there is a lot of necessary information left out (see above the picture of the output I am looking for).

Upvotes: 3

Views: 76

Answers (1)

DSteman
DSteman

Reputation: 1668

If the JSON object is assigned to variable 'x' as a dict, the following generates a df where the columns are the speakers and rows are the phrases.

speakers = [i.get('speaker') for i in x.get('recognizedPhrases')]
phrases = [i.get('nBest')[0].get('display') for i in x.get('recognizedPhrases')]
columns = list(set(speakers))
data = [[x[1] for x in zip(speakers, phrases) if x[0] == i] for i in columns]
df = pd.DataFrame(data, columns=columns)

Upvotes: 1

Related Questions