Reputation: 439
We need to get the data from the file file.data into a DataFrame. The problem is that the data on each line of the file is in either a JSON or Tab-separated values (TSV) format.
The JSON lines are in the correct format, they just need to be converted to native Python dicts.
The TSV lines need to be converted in to dicts that match the JSON format.
Here is a sample of the file:
{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons Persistent contextually-based standardization 018.666.0600 America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": "+1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}
Write a generator that takes an iterator as an argument. It should parse the values in the iterator and yield each value in the correct format: A dict with the keys:
My code so far:
df = pd.read_csv("file.data", sep="\t")
for col in df[["company"]]:
obj = df[col]
for item in obj.values:
json_obj = json.loads(item)
Upvotes: 0
Views: 736
Reputation: 25489
Don't use pandas to read the entire file. Instead, read the file line by line, and create a list of dicts. Then use pandas to get your dataframe.
dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
for line in f:
line = line.strip()
try:
d = json.loads(line)
dict_data.append(d)
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list
After this, we have
dict_data = [{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': '+1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
tsv_data = [['Bennett and Sons',
'Persistent contextually-based standardization',
'018.666.0600',
'America/Los_Angeles',
'492'],
['Ferguson-Garner',
'Multi-layered tertiary neural-net',
'(086)401-8955x53502',
'America/Los_Angeles',
'528']]
Notice that everything in tsv_data
is a string, so we're going to have to fix that at some point.
Now, create a dataframe using the two lists dict_data
and tsv_data
, change the data type for the tsv
dataframe, and join the two.
data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)
for column in df_tsv:
df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)
df_all = df_dict.append(df_tsv).reset_index(drop=True)
df_all
looks like this:
company | catch_phrase | phone | timezone | client_count | |
---|---|---|---|---|---|
0 | Watkins Inc | Integrated radical installation | 7712422719 | America/New_York | 442 |
1 | Pennington PLC | Future-proofed tertiary frame | +1-312-296-2956x137 | America/Indiana/Indianapolis | 638 |
2 | Bennett and Sons | Persistent contextually-based standardization | 018.666.0600 | America/Los_Angeles | 492 |
3 | Ferguson-Garner | Multi-layered tertiary neural-net | (086)401-8955x53502 | America/Los_Angeles | 528 |
Applying this to work with a generator function like you originally wanted:
def parse_file(file_iterator):
dict_keys_types = None
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
Now, you can pass a file iterator to this function and it'll yield dictionaries like you want:
list(parse_file(f))
[{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Bennett and Sons',
'catch_phrase': 'Persistent contextually-based standardization',
'phone': '018.666.0600',
'timezone': 'America/Los_Angeles',
'client_count': 492},
{'company': 'Ferguson-Garner',
'catch_phrase': 'Multi-layered tertiary neural-net',
'phone': '(086)401-8955x53502',
'timezone': 'America/Los_Angeles',
'client_count': 528},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': '+1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
When the first line of the file is not a json dict, this will cause an error because it won't have the keys and data types. Instead of inferring the keys and values from the first json dict you see, you could either hardcode the keys and data types, or put the tsv lines that come before a dict into a separate list to be parsed later.
Hardcode approach:
def parse_file(file_iterator):
dict_keys_types = [('company', str),
('catch_phrase', str),
('phone', str),
('timezone', str),
('client_count', int)]
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
Save-for-later approach:
def parse_file(file_iterator):
dict_keys_types = None
unused_tsv_lines = []
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
if dict_keys_types: # Check if this is set already
# If it is,
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
else: # Else add to unused_tsv_lines
unused_tsv_lines.append(tsv_data)
# After you've finished reading the file, try to reparse the lines
# you skipped before
if dict_keys_types: # Before parsing, make sure dict_keys_types was set
for tsv_data in unused_tsv_lines:
# With each line, do the same thing as before
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
Upvotes: 1