user3449212
user3449212

Reputation:

python scraping multiple string with different conditions

My text looks like this:

<COREF ID="1">Salman</COREF> <COREF ID="2">Khan</COREF> (pronunciation born <COREF ID="3" REF="2">Abdul Rashid Salim Salman Khan</COREF> on 27 December 1965)[3] is an <COREF ID="14">Indian</COREF> film <COREF ID="15">actor</COREF>, <COREF ID="17">producer</COREF>, television <COREF ID="19">presenter</COREF>, and <COREF ID="20">philanthropist</COREF> known for <COREF ID="4" REF="2">his</COREF> Hindi films. <COREF ID="5" REF="2">He</COREF> is the <COREF ID="21">son</COREF> of <COREF ID="16" REF="15">actor</COREF> and screenwriter Salim <COREF ID="6" REF="2">Khan</COREF>. <COREF ID="7" REF="2">Khan</COREF> began <COREF ID="8" REF="2">his</COREF> acting career with <COREF ID="22">Biwi Ho</COREF> To <COREF ID="24">Aisi</COREF> but <COREF ID="18" REF="17">it</COREF> was <COREF ID="9" REF="2">his</COREF> second film <COREF ID="25">Maine Pyar</COREF> <COREF ID="26">Kiya</COREF>(1989), in which <COREF ID="10" REF="2">he</COREF> acted in a lead role, that garnered <COREF ID="11" REF="2">him</COREF> the Filmfare Award for Best Male Debut. <COREF ID="12" REF="2">Khan</COREF> has starred in several commercially successful films, such as <COREF ID="28">Saajan</COREF> (1991), <COREF ID="29">Hum Aapke Hain Koun</COREF>..! (1994), <COREF ID="30">Karan Arjun</COREF> (1995),<COREF ID="31">Judwaa</COREF> (1997), <COREF ID="32">Pyar</COREF> <COREF ID="27" REF="26">Kiya</COREF> To Darna <COREF ID="33">Kya</COREF> (1998), <COREF ID="23" REF="22">Biwi</COREF> No.1 (1999), and Hum Saath <COREF ID="34">Saath Hain</COREF> (1999), having appeared in the highest grossing film nine separate years during <COREF ID="13" REF="2">his</COREF> career, a record that remains unbroken.[4]

What I want to do is

  1. Getting each ID with it's string
  2. Getting only those id which has REF. Result should give ID string and REF string. If we have ID and REF num then we can collect the string from result 1 using map data structure

I tried in this way:

def doit(text):      
  import re
  matches=re.findall(r'\>(.+?)\<',text)
  # matches is now ['String 1', 'String 2', 'String3']
  return ",".join(matches)
print doit(string)

which results all strings individually

Now to scrap each ID I did in this way:

def doit(text):      
    import re
    #matches = re.findall((?<="ID=")(.*)(?=""))
    matches = re.findall(r'ID=\"(\d+)', text)
    return ",".join(matches)

print doit(string)

To scrap content between ID=" and " i.e. ID number but it gives error

SyntaxError: invalid syntax

What wrong I am doing. Any better alternative?

UPDATE:

string = "<COREF ID="1">Salman</COREF> <COREF ID="2">Khan</COREF> (pronunciation born <COREF ID="3" REF="2">Abdul Rashid Salim Salman Khan</COREF> on 27 December 1965)[3] is an <COREF ID="14">Indian</COREF> film <COREF ID="15">actor</COREF>, <COREF ID="17">producer</COREF>, television <COREF ID="19">presenter</COREF>, and <COREF ID="20">philanthropist</COREF> known for <COREF ID="4" REF="2">his</COREF> Hindi films. <COREF ID="5" REF="2">He</COREF> is the <COREF ID="21">son</COREF> of <COREF ID="16" REF="15">actor</COREF> and screenwriter Salim <COREF ID="6" REF="2">Khan</COREF>. <COREF ID="7" REF="2">Khan</COREF> began <COREF ID="8" REF="2">his</COREF>"

def doit(text):      
    import re
    #matches = re.findall((?<="ID=")(.*)(?=""))
    matches = re.findall(r'ID=\"(\d+)', text)
    return ",".join(matches)

print doit(string)

Upvotes: 0

Views: 50

Answers (1)

Jose Varez
Jose Varez

Reputation: 2077

If you just want the ID and they are all numeric, try this:

re.findall(r'ID=\"(\d+)', text)

d+ will only capture numbers.

Upvotes: 1

Related Questions