user3476463
user3476463

Reputation: 4575

parse json data to pandas column with missing quotes

I have a pandas dataframe "df". the dataframe has a field in it "json_field" that contains json data. I think it's actually yaml format. I'm trying to parse the json values in to their own columns. I have example data below. when I run the code below I'm getting an error when it hits the 'name' field. the code and the error are below. it seems like maybe it's having trouble because the value associated with the name field isn't quoted, or has spaces. does anyone see what the issue might be or suggest how to fix?

example:

print(df['json_field'][0])

output:

- start:"test"
  name: Price Unit
  field: priceunit
  type: number
  operator: between
  value:
  - '40'
  - '60'

code:

import yaml

pd.io.json.json_normalize(yaml.load(df['json_field'][0]), 'name','field','type','operator','value').head()

error:

An error was encountered:
mapping values are not allowed here
  in "<unicode string>", line 3, column 7:
      name: Price Unit
          ^
Traceback (most recent call last):
  File "/usr/local/lib64/python3.6/site-packages/yaml/__init__.py", line 94, in safe_load
    return load(stream, SafeLoader)
  File "/usr/local/lib64/python3.6/site-packages/yaml/__init__.py", line 72, in load
    return loader.get_single_data()
  File "/usr/local/lib64/python3.6/site-packages/yaml/constructor.py", line 35, in get_single_data
    node = self.get_single_node()
  File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 36, in get_single_node
    document = self.compose_document()
  File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 55, in compose_document
    node = self.compose_node(None, None)
  File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 82, in compose_node
    node = self.compose_sequence_node(anchor)
  File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 110, in compose_sequence_node
    while not self.check_event(SequenceEndEvent):
  File "/usr/local/lib64/python3.6/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/usr/local/lib64/python3.6/site-packages/yaml/parser.py", line 382, in parse_block_sequence_entry
    if self.check_token(BlockEntryToken):
  File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 220, in fetch_more_tokens
    return self.fetch_value()
  File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 580, in fetch_value
    self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
  in "<unicode string>", line 3, column 7:
      name: Price Unit

desired output:

name        field     type   operator value
Price Unit  priceunit number between  - '40' - '60'

Update:

I tried the suggestion below

import yaml

df['json_field']=df['json_field'].str.replace('"test"', "")

pd.io.json.json_normalize(yaml.safe_load(df['json_field'][0].lstrip()), 'name','field','type','operator','value').head()

and got the output below:

  operator0  typefield
0         P  priceunit
1         r  priceunit
2         i  priceunit
3         c  priceunit
4         e  priceunit

Upvotes: 0

Views: 268

Answers (2)

Cameron Riddell
Cameron Riddell

Reputation: 13407

Seems to me that it's having an issue because your yaml is improperly formatted. That "test" should not be there at all if you want a dictionary named "start" with 5 mappings inside of it.

import yaml

a = """
- start:
    name: Price Unit
    field: priceunit
    type: number
    operator: between
    value:
    - '40'
    - '60'
""".lstrip()

# Single dictionary named start, with 5 entries
yaml.safe_load(a)[0]

{'start': 
 {'name': 'Price Unit',
  'field': 'priceunit',
  'type': 'number',
  'operator': 'between',
  'value': ['40', '60']}
}

To do this with your data, try:

data = yaml.load(df['json_field'][0].replace('"test"', ""))
df = pd.json_normalize(yaml.safe_load(a)[0]["start"])

print(df)
         name      field    type operator     value
0  Price Unit  priceunit  number  between  [40, 60]

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 30971

It seems that your real problem in one line earlier. Note that your YAML fragment starts with start:"test".

One of key principles of YAML is that a key is separated from its value by a colon and a space, whereas your sample does not contain this (mandatory) space.

Maybe you should manually edit your input, i.e. add this missing space.

Other solution is to write a "specialized pre-parser", which adds such missing spaces and its result is fed into the YAML parser. This can be an option if the number of such cases is big.

Upvotes: 0

Related Questions