Naffi
Naffi

Reputation: 730

Python: Remove broken URL from text

I have a file with a lot of broken URLs. By broken I mean, the URLs have SPACEs at random places. For example,

I like soccer. Watch this. https:// m.facebook.com/story.php?stor y_fbid=101595031&id=831030 I also like football.

See the spaces before m.facebook.com and before y_bid.

There is no pattern in the placement of the spaces. They are random.

Is there any way to clean/remove these broken URLs from the whole text file; preferably using Python?

For the above example, the preferred output would be-

I like soccer. I also like football.

Upvotes: 0

Views: 200

Answers (2)

Jean Carlo Machado
Jean Carlo Machado

Reputation: 1618

The simplest SHELL solution I can think of is simply using grep to remove every line with spaces.

cat /tmp/bokenURLsFile | grep -v " "  > /tmp/validURLsOnly

If you're not deploying your "url cleasing" it seems the best way to go.

Upvotes: 1

ShpielMeister
ShpielMeister

Reputation: 1455

using *nix you can easily remove blanks from lines in file fred:

cat fred | tr -d ' ' > newfred

it would be difficult to remove the URL since there is no rule to specify it's end. it would be easy to delete the url up to the first blank, by using something like:

sed 's/http.* //'

your best shot to remove exactly the URL with embedded blanks, would be to know how the file you are processing is generated, and if possible, intercept the problem you are having, earlier.

Upvotes: 0

Related Questions