Binx
Binx

Reputation: 414

Splitting a list with regex

I am having some trouble trying to split each element within a nested list. I used this method for my first split. I want to do another split to the now nested list. I thought I could simply use the same line of code with a few modifications goal2 = [[j.split("") for j in goal]], but I continue to get a common error: 'list' object has no attribute 'split'. I know that you cannot split a list, but I do not understand why my modification is any different than the linked method. This is my first project with web scraping and I am looking for just the phone numbers of the website. I'd like some help to fix my issue and not a new code so that I can continue to learn and improve my own methods.

import requests
import re
from bs4 import BeautifulSoup


source = requests.get('https://www.pickyourownchristmastree.org/ORxmasnw.php').text
soup = BeautifulSoup(source, 'lxml')

info = soup.findAll(text=re.compile("((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})"))[:1]
goal = [i.split(".") for i in info]
goal2 = [[j.split("") for j in goal]]

for x in goal:
    del x[2:]

for y in goal:
    del y[:1]



print('info:', info)
print('goal:', goal)

Output without goal2 variable:

info: ['89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720. Open: ']
goal: [[' Phone: 503-325-9720']]

Desired Output with "goal2" variable:

info: [info: ['89426 Green Mountain Road, Astoria, OR 97103. Phone: 503-325-9720. Open: ']
goal: [[' Phone: 503-325-9720']]
goal2: ['503-325-9720']

I will obviously have more more numbers, but I didn't want to clog up the space. So it would look somthing more like this:

goal2: ['503-325-9720', '###-###-####', '###-###-####', '###-###-####']

But I want to make sure that each number can be exported into a new row within a csv file. So when I create a csv file with a header "Phone" each number above will be in a seperate row and not clustered together. I am thinking that I might need to change my code to a for loop???

Upvotes: 1

Views: 93

Answers (1)

r.ook
r.ook

Reputation: 13888

The cleaner approach here would be to just do another regex search on your info, e.g.:

pat = re.compile(r'\d{3}\-\d{3}\-\d{4}')
goal = [pat.search(i).group() for i in info if pat.search(i)]

Outputs:

goal: ['503-325-9720']

Or if there are more than one number per line:

# use captive group instead
pat = re.compile(r'(\d{3}\-\d{3}\-\d{4})')
goal = [pat.findall(i) for i in info]

Outputs:

goal = [['503-325-9720', '123-456-7890']]

Upvotes: 1

Related Questions