Reputation: 730
I have a file with a lot of broken URLs. By broken I mean, the URLs have SPACEs at random places. For example,
I like soccer. Watch this. https:// m.facebook.com/story.php?stor y_fbid=101595031&id=831030 I also like football.
See the spaces before m.facebook.com and before y_bid.
There is no pattern in the placement of the spaces. They are random.
Is there any way to clean/remove these broken URLs from the whole text file; preferably using Python?
For the above example, the preferred output would be-
I like soccer. I also like football.
Upvotes: 0
Views: 200
Reputation: 1618
The simplest SHELL solution I can think of is simply using grep to remove every line with spaces.
cat /tmp/bokenURLsFile | grep -v " " > /tmp/validURLsOnly
If you're not deploying your "url cleasing" it seems the best way to go.
Upvotes: 1
Reputation: 1455
using *nix you can easily remove blanks from lines in file fred:
cat fred | tr -d ' ' > newfred
it would be difficult to remove the URL since there is no rule to specify it's end. it would be easy to delete the url up to the first blank, by using something like:
sed 's/http.* //'
your best shot to remove exactly the URL with embedded blanks, would be to know how the file you are processing is generated, and if possible, intercept the problem you are having, earlier.
Upvotes: 0