Reputation: 35
I have this code to scrape tagged users ids from medias on twitter:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import csv
import re
# Create a new instance of the Firefox driver
driver = webdriver.Firefox()
# go to page
driver.get("http://twitter.com/RussiaUN/media")
#You can adjust it but this works fine
SCROLL_PAUSE_TIME = 2
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Now that the page is fully scrolled, grab the source code.
src = driver.page_source
#Past it into BS
soup = BeautifulSoup(src, 'html.parser')
#divs = soup.find_all('div',class_='account')
divs = soup.find_all('div', {"data-user-id" : re.compile(r".*")})
#PRINT RESULT
#print('printing results')
#for div in divs:
# print(div['data-user-id'])
#SAVE IN FILE
print('Saving results')
#with open('file2.csv','w') as f:
# for div in divs:
# f.write(div['data-user-id']+'\n')
with open('file.csv','w', newline='') as f:
writer = csv.writer(f)
for div in divs:
writer.writerow([div['data-user-id']])
-But I would like to also scrape the usernames and then organise all these datas in a csv with a column IDS and a column USERNAMES.
So my guess is that I have to modify this piece of code first:
divs = soup.find_all('div', {"data-user-id" : re.compile(r".*")})
But I can't find a way to achieve that...
-Then I also have a problem with duplicates. As you can see in the code there are two ways to scrape the data:
1 #divs = soup.find_all('div',class_='account')
2 divs = soup.find_all('div', {"data-user-id" : re.compile(r".*")})
The first phrase seemed to work but was not efficient enough. Number 2 works fine but seems to give me dupplicates at the end as it goes through all the divs and not only the class_='account'
.
I'm sorry if some feel that I'm a bit spammy here as I posted 3 questions in 24h...And thanks to those who helped and will be helping.
Upvotes: 1
Views: 906
Reputation: 8255
Python has an inbuilt csv module for writing csv files.
Also the scroll script that you used did not seem to work as it was not scrolling all the way down and stopped after a certain amount of time. I just got ~ 1400 records in the csv file with your script.I have replaced it with pagedown key. You may want to tweak the no_of_pagedowns
to control the amount you want to scroll down. Even with 200
pagedowns i got ~2200 records. Note that this number is without removing the duplicates.
I have added some additional modifications to write only the unique data to file.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
driver = webdriver.Firefox()
driver.get("http://twitter.com/RussiaUN/media")
time.sleep(1)
elem = driver.find_element_by_tag_name("html")
no_of_pagedowns = 200
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
no_of_pagedowns-=1
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
all_data=[]
#get only unique data
for div in divs:
single=[div['data-user-id'],div['data-screen-name']]
if single not in all_data:
all_data.append(single)
with open('file.csv','w') as f:
writer = csv.writer(f, delimiter=",")
#headers
writer.writerow(["ID","USERNAME"])
writer.writerows(all_data)
Output
ID,USERNAME
255493944,MID_RF
2230446228,Rus_Emb_Sudan
1024596885661802496,ambrus_drc
2905424987,Russie_au_Congo
2174261359,RusEmbUganda
285532415,tass_agency
34200559,rianru
40807205,kpru
177502586,nezavisimaya_g
23936177,vzglyad
255471924,mfa_russia
453639812,pass_blue
...
If you want the duplicates just remove the if condition
for div in divs:
single=[div['data-user-id'],div['data-screen-name']]
all_data.append(single)
Upvotes: 1