user1592211
user1592211

Reputation: 81

take urls from lines of a file in python

This is a line of a file and I want to take only the url after the word uri and the url after smallPictureUrl to use it later but i can not find a proper way

The asterisks represent text or numbers or both together and the are different in every line who looks like this so they can not be helpfull, the have not a pattern to take advantage of it

{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg", 
"timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__

\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
\",\"width\":180,\"height\":135}}}",
    "subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",

in something more simple like:

{"displayName":"Jim Test","firstName":"*","lastName":"*"} 

i managed to take the name for example Jim Test after displayName with using the re.search('(?<="displayName":")(\w+) (\w+)',line) but for the other is very complicated if you can give me any direction or advice .

a line is exactly like this

{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s200x200/*_*_*_*.jpg","timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.40652557319224},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/*_*_*_a.jpg\",\"width\":180,\"height\":120}}}","subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s100x100/*_*_*_a.jpg","contactId":"**==","contactType":"USER","friendshipStatus":"ARE_FRIENDS","graphApiWriteId":"contact_*:*:*","hugePictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s720x720/*_*_*_*.jpg","profileFbid":"*","isMobilePushable":"NO","lookupKey":null,"name":{"displayName":"* *","firstName":"*","lastName":"*"},"nameSearchTokens":["*","*"],"phones":[],"phoneticName":{"displayName":null,"firstName":null,"lastName":null},"isMemorialized":false,"communicationRank":0.4183731,"canViewerSendGift":false,"canMessage":true}

Upvotes: 0

Views: 137

Answers (3)

Pedro Lobito
Pedro Lobito

Reputation: 98871

#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
import re, urllib

GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"):
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ]

Upvotes: 2

Shan Valleru
Shan Valleru

Reputation: 3121

If you not okay with using json, how about this ?

>>> print mytext

{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg", 
"timelineCoverPhoto":"{"focus":{"x":0.5,"y":0.49137931034483},"photo":{"__type__

":{"name":"Photo"},"image_lowres":{"uri":"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
","width":180,"height":135}}}",
    "subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",

>>> uri = re.findall(r'uri\"\:\"[\'"]?([^\'" >]+)', mytext) #gets the uri
>>> smallpicurl = re.findall(r'smallPictureUrl\"\:\"[\'"]?([^\'" >]+)', mytext) # gets the smallPictureUrl
>>> ''.join(uri).rstrip()
'https://fbcdn-*-*-*.*.*/*-*-*/*.jpg' # uri
>>> ''.join(smallpicurl).rstrip()
'https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg' # smallPictureUrl

Upvotes: 1

toppur
toppur

Reputation: 1794

The value associated with timelineCoverPhoto seems to be stringified JSON, so you could do something admittedly ugly like this:

import json 
s = {
        "subscribeStatus": "IS_SUBSCRIBED",
        "bigPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
        "timelineCoverPhoto": "{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg \",\"width\":180,\"height\":135}}}",
        "smallPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg" 
    } 

j = json.loads(s.get('timelineCoverPhoto')) 
print "uri:", j.get('photo').get('image_lowres').get('uri')

uri: https://fbcdn-*-*-*.*.*/*-*-*/*.jpg 

Upvotes: 2

Related Questions