python scraping multiple string with different conditions

Question

My text looks like this:

Salman Khan (pronunciation born Abdul Rashid Salim Salman Khan on 27 December 1965)[3] is an Indian film actor, producer, television presenter, and philanthropist known for his Hindi films. He is the son of actor and screenwriter Salim Khan. Khan began his acting career with Biwi Ho To Aisi but it was his second film Maine Pyar Kiya(1989), in which he acted in a lead role, that garnered him the Filmfare Award for Best Male Debut. Khan has starred in several commercially successful films, such as Saajan (1991), Hum Aapke Hain Koun..! (1994), Karan Arjun (1995),Judwaa (1997), Pyar Kiya To Darna Kya (1998), Biwi No.1 (1999), and Hum Saath Saath Hain (1999), having appeared in the highest grossing film nine separate years during his career, a record that remains unbroken.[4]

What I want to do is

Getting each ID with it's string
Getting only those id which has REF. Result should give ID string and REF string. If we have ID and REF num then we can collect the string from result 1 using map data structure

I tried in this way:

def doit(text):      
  import re
  matches=re.findall(r'\>(.+?)\<',text)
  # matches is now ['String 1', 'String 2', 'String3']
  return ",".join(matches)
print doit(string)

which results all strings individually

Now to scrap each ID I did in this way:

def doit(text):      
    import re
    #matches = re.findall((?<="ID=")(.*)(?=""))
    matches = re.findall(r'ID=\"(\d+)', text)
    return ",".join(matches)

print doit(string)

To scrap content between ID=" and " i.e. ID number but it gives error

SyntaxError: invalid syntax

What wrong I am doing. Any better alternative?

UPDATE:

string = "Salman Khan (pronunciation born Abdul Rashid Salim Salman Khan on 27 December 1965)[3] is an Indian film actor, producer, television presenter, and philanthropist known for his Hindi films. He is the son of actor and screenwriter Salim Khan. Khan began his"

def doit(text):      
    import re
    #matches = re.findall((?<="ID=")(.*)(?=""))
    matches = re.findall(r'ID=\"(\d+)', text)
    return ",".join(matches)

print doit(string)

python scraping multiple string with different conditions

Answers (1)

Related Questions