Hefe
Hefe

Reputation: 439

Use a Generator To Convert JSON and TSV Data into a Dictionary

We need to get the data from the file file.data into a DataFrame. The problem is that the data on each line of the file is in either a JSON or Tab-separated values (TSV) format.

The JSON lines are in the correct format, they just need to be converted to native Python dicts.

The TSV lines need to be converted in to dicts that match the JSON format.

Here is a sample of the file:

{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons    Persistent contextually-based standardization   018.666.0600    America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net   (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": "+1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}

Write a generator that takes an iterator as an argument. It should parse the values in the iterator and yield each value in the correct format: A dict with the keys:

My code so far:

df = pd.read_csv("file.data", sep="\t")
    for col in df[["company"]]:
        obj = df[col]
        for item in obj.values:
            json_obj = json.loads(item)

Upvotes: 0

Views: 736

Answers (1)

pho
pho

Reputation: 25489

Don't use pandas to read the entire file. Instead, read the file line by line, and create a list of dicts. Then use pandas to get your dataframe.

dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
    for line in f:
        line = line.strip()
        try:
            d = json.loads(line)
            dict_data.append(d)
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list

After this, we have

dict_data = [{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': '+1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]

tsv_data = [['Bennett and Sons',
  'Persistent contextually-based standardization',
  '018.666.0600',
  'America/Los_Angeles',
  '492'],
 ['Ferguson-Garner',
  'Multi-layered tertiary neural-net',
  '(086)401-8955x53502',
  'America/Los_Angeles',
  '528']]

Notice that everything in tsv_data is a string, so we're going to have to fix that at some point.

Now, create a dataframe using the two lists dict_data and tsv_data, change the data type for the tsv dataframe, and join the two.

data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)


for column in df_tsv:
    df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)

df_all = df_dict.append(df_tsv).reset_index(drop=True)

df_all looks like this:

company catch_phrase phone timezone client_count
0 Watkins Inc Integrated radical installation 7712422719 America/New_York 442
1 Pennington PLC Future-proofed tertiary frame +1-312-296-2956x137 America/Indiana/Indianapolis 638
2 Bennett and Sons Persistent contextually-based standardization 018.666.0600 America/Los_Angeles 492
3 Ferguson-Garner Multi-layered tertiary neural-net (086)401-8955x53502 America/Los_Angeles 528

Applying this to work with a generator function like you originally wanted:

def parse_file(file_iterator):
    dict_keys_types = None

    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            # When you read a valid dict, set the keys and types 
            # So you can parse the tsv lines
            dict_keys_types = [
                  (key, type(value)) 
                  for key, value in d.items()
             ]
             yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict
        

Now, you can pass a file iterator to this function and it'll yield dictionaries like you want:

list(parse_file(f))

[{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Bennett and Sons',
  'catch_phrase': 'Persistent contextually-based standardization',
  'phone': '018.666.0600',
  'timezone': 'America/Los_Angeles',
  'client_count': 492},
 {'company': 'Ferguson-Garner',
  'catch_phrase': 'Multi-layered tertiary neural-net',
  'phone': '(086)401-8955x53502',
  'timezone': 'America/Los_Angeles',
  'client_count': 528},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': '+1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]


When the first line of the file is not a json dict, this will cause an error because it won't have the keys and data types. Instead of inferring the keys and values from the first json dict you see, you could either hardcode the keys and data types, or put the tsv lines that come before a dict into a separate list to be parsed later.

Hardcode approach:

def parse_file(file_iterator):
    dict_keys_types = [('company', str),
         ('catch_phrase', str),
         ('phone', str),
         ('timezone', str),
         ('client_count', int)]

    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict

Save-for-later approach:

def parse_file(file_iterator):
    dict_keys_types = None
    unused_tsv_lines = []
    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            # When you read a valid dict, set the keys and types 
            # So you can parse the tsv lines
            dict_keys_types = [
                  (key, type(value)) 
                  for key, value in d.items()
             ]
             yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            if dict_keys_types: # Check if this is set already
                # If it is, 
                # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
                tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
                yield tsv_dict
            else: # Else add to unused_tsv_lines
                unused_tsv_lines.append(tsv_data)

    # After you've finished reading the file, try to reparse the lines
    # you skipped before
    if dict_keys_types: # Before parsing, make sure dict_keys_types was set
        for tsv_data in unused_tsv_lines:
            # With each line, do the same thing as before
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict
        

Upvotes: 1

Related Questions