Reputation: 93
I am building a LinkedIn scraper to be used for the scraping of companies basic information from LinkedIn.
I have a text file which contains the companies list and I am reading it and then making a google search to extract the first link (search linkedin.com + company name).
I stored all the links in a list. Now the problem is some of the companies are in different languages and I am getting linkedin urls of profiles as well as some non linkedin links.
my list looks like as
['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo',
'https://co.linkedin.com/in/jose-anibal-lerma-moreno-2b389aa3',
'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas',
'https://www.linkedin.com/in/javier-torres-camargo-b983443a',
'https://in.linkedin.com/company/sas',
'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s',
'https://www.linkedin.com/company/henkel_2',
'https://in.linkedin.com/company/sas',
'https://www.linkedin.com/company/quimica-vulcano-s-a',
'https://in.linkedin.com/company/sas',
'https://www.linkedin.com/company/ismocol-de-colombia-s-a-',
'https://in.linkedin.com/company/sas',
'https://www.facebook.com/IMCTCajica/',....
Now, if you see this, here I have company links, and all other links, I just want to extract/keep the links which contains -
"linkedin.com/company"
Any ways to do the same or any better approach to get maximum links containing the same.
Upvotes: 0
Views: 82
Reputation: 12015
Use list comprehension and filter out unnecessary elements
>>> lst = ['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo', 'https://co.linkedin.com/in/jose-anibal-lerma-moreno-2b389aa3', 'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas', 'https://www.linkedin.com/in/javier-torres-camargo-b983443a', 'https://in.linkedin.com/company/sas', 'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s', 'https://www.linkedin.com/company/henkel_2', 'https://in.linkedin.com/company/sas', 'https://www.linkedin.com/company/quimica-vulcano-s-a', 'https://in.linkedin.com/company/sas', 'https://www.linkedin.com/company/ismocol-de-colombia-s-a-', 'https://in.linkedin.com/company/sas', 'https://www.facebook.com/IMCTCajica/']
>>>
>>> new_lst = [url for url in lst if "linkedin.com/company" in url]
>>> pprint(new_lst)
['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo',
'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas',
'https://in.linkedin.com/company/sas',
'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s',
'https://www.linkedin.com/company/henkel_2',
'https://in.linkedin.com/company/sas',
'https://www.linkedin.com/company/quimica-vulcano-s-a',
'https://in.linkedin.com/company/sas',
'https://www.linkedin.com/company/ismocol-de-colombia-s-a-',
'https://in.linkedin.com/company/sas']
Upvotes: 2
Reputation: 7844
You can also do it using the filter
function:
inList = ['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo',
'https://co.linkedin.com/in/jose-anibal-lerma-moreno-2b389aa3',
'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas',
'https://www.linkedin.com/in/javier-torres-camargo-b983443a',
'https://in.linkedin.com/company/sas',
'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s',
'https://www.linkedin.com/company/henkel_2',
'https://in.linkedin.com/company/sas',
'https://www.linkedin.com/company/quimica-vulcano-s-a',
'https://in.linkedin.com/company/sas',
'https://www.linkedin.com/company/ismocol-de-colombia-s-a-',
'https://in.linkedin.com/company/sas',
'https://www.facebook.com/IMCTCajica/']
link = "linkedin.com/company"
outList = list(filter(lambda elem: link in elem, inList))
for i in outList:
print(i)
Output:
https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo
https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas
https://in.linkedin.com/company/sas
https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s
https://www.linkedin.com/company/henkel_2
https://in.linkedin.com/company/sas
https://www.linkedin.com/company/quimica-vulcano-s-a
https://in.linkedin.com/company/sas
https://www.linkedin.com/company/ismocol-de-colombia-s-a-
https://in.linkedin.com/company/sas
Upvotes: 2