Reputation: 23
I am sitting with a project for my masters, where I would like to scrape LinkedIn. As far as I am now, I ran into a problem when I want to scrape the education pages of users (eg. https://www.linkedin.com/in/williamhgates/details/education/)
I would like to scrape all the educations of the users. In this example I would like to scrape "Harvard University" under mr1 hoverable-link-text t-bold
, but I can't see to get to it.
Here's the HTML at code from Linkedin:
<li class="pvs-list__paged-list-item artdeco-list__item pvs-list__item--line-separated " id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0">
<!----><div class="pvs-entity
pvs-entity--padded pvs-list__item--no-padding-when-nested
">
<div>
<a class="optional-action-target-wrapper
display-flex" target="_self" href="https://www.linkedin.com/company/1646/">
<div class="ivm-image-view-model pvs-entity__image ">
<div class="ivm-view-attr__img-wrapper ivm-view-attr__img-wrapper--use-img-tag display-flex
">
<!----> <img width="48" src="https://media-exp1.licdn.com/dms/image/C4E0BAQF5t62bcL0e9g/company-logo_100_100/0/1519855919126?e=1668643200&v=beta&t=BL0HxGNOasVbI3u39HBSL3n7H-yYADkJsqS3vafg-Ak" loading="lazy" height="48" alt="Harvard University logo" id="ember59" class="ivm-view-attr__img--centered EntityPhoto-square-3 lazy-image ember-view">
</div>
</div>
</a>
</div>
<div class="display-flex flex-column full-width align-self-center">
<div class="display-flex flex-row justify-space-between">
<a class="optional-action-target-wrapper
display-flex flex-column full-width" target="_self" href="https://www.linkedin.com/company/1646/">
<div class="display-flex align-items-center">
<span class="mr1 hoverable-link-text t-bold">
<span aria-hidden="true"><!---->Harvard University<!----></span><span class="visually-hidden"><!---->Harvard University<!----></span>
</span>
<!----><!----><!----> </div>
<!----> <span class="t-14 t-normal t-black--light">
<span aria-hidden="true"><!---->1973 - 1975<!----></span><span class="visually-hidden"><!---->1973 - 1975<!----></span>
</span>
<!----> </a>
<!---->
<div class="pvs-entity__action-container">
<!----> </div>
</div>
<div class="pvs-list__outer-container">
<!----> <ul class="pvs-list
">
<li class=" ">
<div class="pvs-list__outer-container">
<!----><!----><!----></div>
</li>
</ul>
<!----></div>
</div>
</div>
</li>
I have tried the following code:
education = driver.find_element("xpath", '//*[@id="profilePagedListComponent-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-EDUCATION-VIEW-DETAILS-profile-ACoAAA8BYqEBCGLg-vT-ca6mMEqkpp9nVffJ3hc-NONE-da-DK-0"]/div/div[2]/div[1]/a/div/span/span[1]/').text
print(education)
I keep getting the error:
Message: no such element: Unable to locate element:
Can anybody help? I would love to have a script that loops through the educations, and save place of education and the year of educations.
Upvotes: 1
Views: 1053
Reputation: 23
Thank you everyone!
I ended up with this code under that worked.
get_education_school = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 'hoverable-link-text')]//span[1]")))]
get_education_years = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 't-14 t-normal t-black--light')]//span[1]")))]
results_education_school = []
results_education_years = []
for i,j in zip(get_education_school, get_education_years):
results_education_school.append(i)
results_education_years.append(j)
print(results_education_school)
print(results_education_years)
Upvotes: 1
Reputation: 193108
To extract the text Harvard University ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.pvs-list>li span.hoverable-link-text span"))).text)
Using XPATH:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//ul[@class='pvs-list ']/li//span[contains(@class, 'hoverable-link-text')]//span"))).text)
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python
Upvotes: 0
Reputation: 511
@Nadia S. you can try the following code. I have provided comments inline inside the code.
@Test
public void linkedInTest() {
driver.get("https://www.linkedin.com");
// You need to enter the credentials for your linkedin below for login
driver.findElement(By.id("session_key")).sendKeys("");
driver.findElement(By.id("session_password")).sendKeys("");
driver.findElement(By.className("sign-in-form__submit-button")).click();
driver.get("https://www.linkedin.com/in/williamhgates/details/education/");
//Wait for the Education details to get populated.
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(7));
wait.until(ExpectedConditions.visibilityOfElementLocated(
By.xpath("//div[@class = 'pvs-list__container']//div[@class = 'scaffold-finite-scroll__content']/ul")));
//Take all elements showing education details in a list
List<WebElement> allEducation = driver.findElements(By
.xpath("//div[@class = 'pvs-list__container']//div[@class = 'scaffold-finite-scroll__content']/ul/li"));
//Extract details of each education item in the list.
//Below the details are directed to console. You can use a collection to store them.
for (WebElement oneEducation : allEducation) {
WebElement education = oneEducation.findElement(
By.xpath(".//*[contains(@class,\"mr1 hoverable-link-text\")]/span[@aria-hidden='true']"));
System.out.print("Education - " + education.getText());
try {
WebElement educationType = oneEducation
.findElement(By.cssSelector(".t-14.t-normal span[aria-hidden='true']"));
System.out.print(" Education Type - " + educationType.getText());
} catch (NoSuchElementException e) {
System.out.print(" Education Type - " + "is Not Specified");
}
try {
WebElement educationYear = oneEducation
.findElement(By.cssSelector(".t-14.t-normal.t-black--light span[aria-hidden='true']"));
System.out.println(" Education Year - " + educationYear.getText());
} catch (NoSuchElementException e) {
System.out.println(" Education Year - " + "is Not Specified");
}
}
}
Upvotes: 0
Reputation: 11
You can use below properties to identify the school name list:
ancestorClass="optional-action-target-wrapper display-flex flex-column full-width" class="display-flex align-items-center" tag="DIV"
Use these properties to identify the year list:
ancestorClass="optional-action-target-wrapper display-flex flex-column full-width" class="t-14 t-normal t-black--light" tag="SPAN"
You may use above info to compose an XPath to locate the list, or if you don't mind using other python libraries, there is a sample code in GitHub to scrape the school and year.
Upvotes: 0
Reputation: 116
I would first get the list for the education section.
education_list = driver.find_element(By.CSS_SELECTOR, 'ul.pvs-list')
# loop through education_list for place and years
# would recommend relative locators for this task.
# find the image and get the first and second span with text inside of them.
I am adding further details to the code now. Please hold.
Upvotes: 0