Reputation: 21
I try to learn data science with python in simplilearn. in matplotlib learning section they do web scraping from here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url="https://www.hubertiming.com/results/2018MLK" #OPEN LINK
html=urlopen(URL)
soup=BeautifulSoup(html,"lxml")
title = soup.title
print (title)
print(title.text)
links = soup.find_all('a',href=True)
for link in links:
print (link['href'])
data =[]
allrows=soup.find_all("tr")
for row in allrows:
row_list = row.find_all("td")
dataRow=[]
data_converted = []
for cell in row_list:
dataRow.append(cell.text)
data.append(dataRow)
data=data[4:]
print(data[-2:])
And this is the results
[['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
how can I get rid the \r\n\r\n
?? i already use "replace"
function and it say "'list' object has no attribute 'replace'"
and also I can not use strip neither.
Upvotes: 1
Views: 437
Reputation: 62393
pandas.read_html
, which will scrape all the tables into a list of dataframes.
.read_html()
will not work.df.Name = df.Name.str.strip()
or df.Name = df.Name.str.replace('\r', '')
, would work.import pandas as pd
url = 'https://www.hubertiming.com/results/2018MLK'
# read the tables
df_list = pd.read_html(url)
# in this case the desired dataframe is at index 1
df = df_list[1]
# display(df.head())
Place Bib Name Gender Age City State Chip Time Chip Pace Gender Place Age Group Age Group Place Time to Start Gun Time
0 1 1191 MAX RANDOLPH M 29.0 WASHINGTON DC 16:48 5:25 1 of 78 M 21-39 1 of 33 0:08 16:56
1 2 1080 NEED NAME KAISER RUNNER M 25.0 PORTLAND OR 17:31 5:39 2 of 78 M 21-39 2 of 33 0:09 17:40
2 3 1275 DAN FRANEK M 52.0 PORTLAND OR 18:15 5:53 3 of 78 M 40-54 1 of 27 0:07 18:22
3 4 1223 PAUL TAYLOR M 54.0 PORTLAND OR 18:31 5:58 4 of 78 M 40-54 2 of 27 0:07 18:38
4 5 1245 THEO KINMAN M 22.0 NaN NaN 19:31 6:17 5 of 78 M 21-39 3 of 33 0:09 19:40
# output the dataframe as an array, and see the values in the last two lists have no escape codes
data = df.to_numpy()
print(data[-2:])
[out]:
array([[190, 2087, 'LEESHA POSEY', 'F', 43.0, 'PORTLAND', 'OR',
'1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00',
'1:33:53'],
[191, 1216, 'ZULMA OCHOA', 'F', 40.0, 'GRESHAM', 'OR', '1:43:27',
'33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']],
dtype=object)
Upvotes: 1
Reputation: 24049
You can do this only. convert: cell.text
to cell.text.strip()
in your code like below:
...
for row in allrows:
row_list = row.find_all("td")
dataRow=[]
data_converted = []
for cell in row_list:
dataRow.append(cell.text.strip())
...
Upvotes: 1
Reputation: 410
You are having 2D List
strip()
methodUse the below code:
text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
result = [[j.strip() for j in i] for i in text]
print(result)
Output:
[['190', '2087', 'LEESHA POSEY', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00', '1:33:53'], ['191', '1216', 'ZULMA OCHOA', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']]
Upvotes: 2
Reputation: 1280
text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
print(text)
for i in range(len(text)):
for j in range(len(text[i])):
text[i][j] = text[i][j].replace('\r\n', '')
print(text)
Output:
[['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
[['190', '2087', ' LEESHA POSEY ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', ' 112 of 113 ', 'F 40-54', ' 36 of 37 ', '0:00', '1:33:53'], ['191', '1216', ' ZULMA OCHOA ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', ' 113 of 113 ', 'F 40-54', ' 37 of 37 ', '0:00', '1:43:27']]
Upvotes: 0