Ayman Alawin
Ayman Alawin

Reputation: 93

Python List - Filter particular text elements and drop off the remaining

I am building a LinkedIn scraper to be used for the scraping of companies basic information from LinkedIn.

I have a text file which contains the companies list and I am reading it and then making a google search to extract the first link (search linkedin.com + company name).

I stored all the links in a list. Now the problem is some of the companies are in different languages and I am getting linkedin urls of profiles as well as some non linkedin links.

my list looks like as

['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo',
 'https://co.linkedin.com/in/jose-anibal-lerma-moreno-2b389aa3',
 'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas',
 'https://www.linkedin.com/in/javier-torres-camargo-b983443a',
 'https://in.linkedin.com/company/sas',
 'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s',
 'https://www.linkedin.com/company/henkel_2',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/quimica-vulcano-s-a',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/ismocol-de-colombia-s-a-',
 'https://in.linkedin.com/company/sas',
 'https://www.facebook.com/IMCTCajica/',....

Now, if you see this, here I have company links, and all other links, I just want to extract/keep the links which contains -

"linkedin.com/company"

Any ways to do the same or any better approach to get maximum links containing the same.

Upvotes: 0

Views: 82

Answers (2)

Sunitha
Sunitha

Reputation: 12015

Use list comprehension and filter out unnecessary elements

>>> lst = ['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo', 'https://co.linkedin.com/in/jose-anibal-lerma-moreno-2b389aa3', 'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas', 'https://www.linkedin.com/in/javier-torres-camargo-b983443a', 'https://in.linkedin.com/company/sas', 'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s', 'https://www.linkedin.com/company/henkel_2', 'https://in.linkedin.com/company/sas', 'https://www.linkedin.com/company/quimica-vulcano-s-a', 'https://in.linkedin.com/company/sas', 'https://www.linkedin.com/company/ismocol-de-colombia-s-a-', 'https://in.linkedin.com/company/sas', 'https://www.facebook.com/IMCTCajica/']
>>> 
>>> new_lst = [url for url in lst if "linkedin.com/company" in url]
>>> pprint(new_lst)
['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo',
 'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas',
 'https://in.linkedin.com/company/sas',
 'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s',
 'https://www.linkedin.com/company/henkel_2',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/quimica-vulcano-s-a',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/ismocol-de-colombia-s-a-',
 'https://in.linkedin.com/company/sas']

Upvotes: 2

Vasilis G.
Vasilis G.

Reputation: 7844

You can also do it using the filter function:

inList = ['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo',
 'https://co.linkedin.com/in/jose-anibal-lerma-moreno-2b389aa3',
 'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas',
 'https://www.linkedin.com/in/javier-torres-camargo-b983443a',
 'https://in.linkedin.com/company/sas',
 'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s',
 'https://www.linkedin.com/company/henkel_2',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/quimica-vulcano-s-a',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/ismocol-de-colombia-s-a-',
 'https://in.linkedin.com/company/sas',
 'https://www.facebook.com/IMCTCajica/']

link = "linkedin.com/company"
outList = list(filter(lambda elem: link in elem, inList))
for i in outList:
    print(i)

Output:

https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo
https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas
https://in.linkedin.com/company/sas
https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s
https://www.linkedin.com/company/henkel_2
https://in.linkedin.com/company/sas
https://www.linkedin.com/company/quimica-vulcano-s-a
https://in.linkedin.com/company/sas
https://www.linkedin.com/company/ismocol-de-colombia-s-a-
https://in.linkedin.com/company/sas

Upvotes: 2

Related Questions