Reputation: 1293
I'm trying to build a self-contained Jupyter notebook that parses a long address string into a pandas
dataframe for demonstration purposes. Currently I'm having to highlight the entire string and use pd.read_clipboard
:
data = pd.read_clipboard(f,
comment='#',
header=None,
names=['address']).values.reshape(-1, 2)
matched_address = pd.DataFrame(data, columns=['addr_zagat', 'addr_fodor'])
I'm wondering if there is an easier way to read the string in directly instead of relying on having something copied to the clipboard. Here are the first few lines of the string for reference:
f = """###################################################################################################
#
# There are 112 matches between the tuples. The Zagat tuple is listed first,
# and then its Fodors pair.
#
###################################################################################################
Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310-246-1501 Steakhouses
Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310/246-1501 American
########################
Art's Deli 12224 Ventura Blvd. Studio City 91604 818-762-1221 Delis
Art's Delicatessen 12224 Ventura Blvd. Studio City 91604 818/762-1221 American
########################
Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 90077 310-472-1211 Californian
Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 90077 310/472-1211 Californian
########################
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818-788-3536 French Bistro
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
########################
h Bistro
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
########################"""
Does anybody have any tips as to how to parse this string directly into a pandas
dataframe?
I realise there is another question that addresses this here: Create Pandas DataFrame from a string but the string is delimited by a semi colon and totally different to the format used in my example.
Upvotes: 1
Views: 636
Reputation: 670
You should add an example of what your output should look like but generally, I would suggest something like this:
import pandas as pd
import numpy as np
# read file, split into lines
f = open("./your_file.txt", "r").read().split('\n')
accumulator = []
# loop through lines
for line in f:
# define criteria for selecting lines
if len(line) > 1 and line[0].isupper():
# define criteria for splitting the line
# get name
first_num_char = [c for c in line if c.isdigit()][0]
name = line.split(first_num_char, 1)[0]
line = line.replace(name, '')
# get restaurant type
rest_type = line.split()[-1]
line = line.replace(rest_type, '')
# get phone number
number = line.split()[-1]
line = line.replace(number, '')
# remainder should be the address
address = line
accumulator.append([name, rest_type, number, address])
# turn accumulator into numpy array, pass with column index to DataFrame constructor
df = pd.DataFrame(np.asarray(accumulator), columns=['name', 'restaurant_type', 'phone_number', 'address'])
Upvotes: 1