Sonicarrow
Sonicarrow

Reputation: 73

Creating a new row whenever a comma appears in the column

I'm trying to create a mini program that will calculate the closest, open restaurant closest to my location. I have a dataset that includes restaurant names, locations, stars, and hours. However, there is a problem: Sometimes a restaurant will have multiple open/close times in a day.

For example:

Name, location, type, and hours

Blue Duck Tavern, 1201 24th St NW, American Restaurant, 6:30-10:30AM, 11:30AM-2PM,5:30-10:30PM

I'm trying to get the data into a CSV, but for restaurants with multiple hours (like in the example), it can't properly parse it.

The easiest solution for this would (I think) create another line with the same information, but the next set of hours. So, the example would then read:

Blue Duck Tavern, 1201 24th St NW, American Restaurant, 6:30-10:30AM

Blue Duck Tavern, 1201 24th St NW, American Restaurant, 11:30AM-2PM

Blue Duck Tavern, 1201 24th St NW, American Restaurant, 5:30-10:30PM

So the program wouldn't show the restaurant if it wasn't open.

So I have three general questions. 1) Is there a better way to go about this than the solution I mentioned above (creating a new row for every iteration of multiple open/close hours) 2) Below, I'm having trouble with the following implementation:

import pandas as pd
import numpy as np

data = pd.import_csv(data.csv)
for row in data: 
    if data['hours'].str.contains(',') == 'True':
        count = data['hours'].str.count(',')
        data.append.. 
        <create new row with Name[row], location[row], type[row], and hours[row] for the # of count>

I've tried google-ing around, and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So I tried to switch it up to:

if data['Monday'].any('Monday').str.contains(',') == 'True': 

which results in: ValueError: No axis named Monday for object type

And I'm a bit unclear on the next steps here, or what I'm doing wrong, because if I just do:

print data[data['Monday'].astype(str).str.contains(',')]

It works and returns the result. But I can't do any kind of conditional without it throwing an error.

3) I'm also a bit confused on what to do if there are more than one comma in the row.. I have a vague idea, but if you have any hints, I'd love to hear them :)

Thanks for reading!

Upvotes: 0

Views: 135

Answers (2)

sacuL
sacuL

Reputation: 51395

If I understand correctly, you can load the data with a regular expression as the separator, making sure that what precedes the comma is not AM or PM (using a negative lookbehind). You can then use str.split and stack, after setting all the columns that you don't want to modify to the index. For example:

data = pd.read_csv('data.csv', sep='(?<!AM|PM),')
# Get rid of spaces in your column names
data.columns = data.columns.str.strip(' ')

>>> data
               Name          location                  type   hours
0  Blue Duck Tavern   1201 24th St NW   American Restaurant   6:30-10:30AM, 11:30AM-2PM,5:30-10:30PM 


new_data = (data.set_index(['Name', 'location', 'type'])
          .hours.str.split(',', expand=True)
          .stack()
          .reset_index(level=['Name', 'location', 'type']))

>>> new_data
               Name          location                  type              0
0  Blue Duck Tavern   1201 24th St NW   American Restaurant   6:30-10:30AM
1  Blue Duck Tavern   1201 24th St NW   American Restaurant    11:30AM-2PM
2  Blue Duck Tavern   1201 24th St NW   American Restaurant   5:30-10:30PM

Upvotes: 1

Sudheer Kumar R
Sudheer Kumar R

Reputation: 1

try to combine multiple hours with '_' or any other delimiter as mentioned below and take it as a whole.

6:30-10:30AM_11:30AM-2PM_5:30-10:30PM

Blue Duck Tavern, 1201 24th St NW, American Restaurant, 6:30-10:30AM_11:30AM-2PM_5:30-10:30PM

Upvotes: 0

Related Questions