RJ_Singh
RJ_Singh

Reputation: 115

Read and process data from URL in python

I am trying to get the data from URL.below is the URL Format.

What I am trying to do
1)read line by line and find if the line contains the desired keyword. 3)If yes then store the previous line's content "GETCONTENT" in a list

<http://www.example.com/XYZ/a-b-c/w#>DONTGETCONTENT    
 a       <http://www.example.com/XYZ/mount/v1#NNNN> , 
<http://www.w3.org/2002/w#Individual> ;
        <http://www.w3.org/2000/01/rdf-schema#label>
                "some content , "some url content ;
        <http://www.example.com/XYZ/log/v1#hasRelation>
                <http://www.example.com/XYZ/data/v1#Change> ;
        <http://www.example.com/XYZ/log/v1#ServicePage>
                <https://dev.org.net/apis/someLabel> ;
        <http://www.example.com/XYZ/log/v1#Description>
                "Some API Content .

<http://www.example.com/XYZ/model/v1#GETBBBBBB>
a       <http://www.w3.org/01/07/w#BBBBBB> ;
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#xyz> ;
        <http://www.w3.org/2000/01/schema#label1>
               "some content , "some url content ;
        <http://www.w3.org/2000/01/schema#range>
                <http://www.w3.org/2001/XMLSchema#boolean> ;
       <http://www.example.com/XYZ/log/v1#Description>
            "Some description .

<http://www.example.com/XYZ/datamodel-ee/v1#GETAAAAAA>
 a       <http://www.w3.org/01/07/w#AAAAAA> ;
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#Version> ;
        <http://www.w3.org/2000/01/schema#label>
                "some content ;
        <http://www.w3.org/2000/01/schema#range>
            <http://www.example.com/XYZ/data/v1#uuu> .

<http://www.example.com/XYZ/datamodel/v1#GETCCCCCC>
 a       <http://www.w3.org/01/07/w#CCCCCC , 
<http://www.w3.org/2002/07/w#Name> 
        <http://www.w3.org/2000/01/schema#domain>
                <http://www.example.com/XYZ/data/v1#xyz> ;
        <http://www.w3.org/2000/01/schema#label1>
              "some content , "some url content ;
        <http://www.w3.org/2000/01/schema#range>
               <http://www.w3.org/2001/XMLSchema#boolean> ;
        <http://www.example.com/XYZ/log/v1#Description>
               "Some description .

below is the code i tried so far but it is printing all the content of the file

  import re
        def read_from_url():
            try:
                from urllib.request import urlopen
            except ImportError:
                from urllib2 import urlopen
            url_link = "examle.com"
            html = urlopen(url_link)
            previous=None
            for line in html:
                previous=line
                line = re.search(r"^(\s*a\s*)|\#GETBBBBBB|#GETAAAAAA|#GETCCCCCC\b", 
        line.decode('UTF-8'))
                print(previous)
        if __name__ == '__main__':
        read_from_url()

Expected output:

GETBBBBBB , GETAAAAAA , GETCCCCCC 

Thanks in advance!!

Upvotes: 1

Views: 4902

Answers (2)

Malekai
Malekai

Reputation: 5011

When it comes to reading data from URLs, the requests library is much simpler:

import requests

url = "https://www.example.com/your/target.html"
text = requests.get(url).text

If you haven't got it installed you could use the following to do so:

pip3 install requests

Next, why go through the hassle of shoving all of your words into a single regular expression when you could use a word array and then use a for loop instead?

For example:

search_words = "hello word world".split(" ")
matching_lines = []

for (i, line) in enumerate(text.split()):
  line = line.strip()
  if len(line) < 1:
    continue
  for word i search_words:
    if re.search("\b" + word + "\b", line):
      matching_lines.append(line)
      continue

Then you'd output the result, like this:

print(matching_lines)

Running this where the text variable equals:

"""
this word will save the line
ignore me!
hello my friend!
what about me?
"""

Should output:

[
  "this word will save the line",
  "hello my friend!"
]

You could make the search case insensitive by using the lower method, like this:

search_words = [word for word in "hello word world".lower().split(" ")]
matching_lines = []

for (i, line) in enumerate(text.split()):
  line = line.strip()
  if len(line) < 1:
    continue
  line = line.lower()
  for word i search_words:
    if re.search("\b" + word + "\b", line):
      matching_lines.append(line)
      continue

Notes and information:

  1. the continue keyword prevents you from searching for more than one word match in the current line
  2. the enumerate function allows us to iterate of the index and the current line
  3. I didn't put the lower function for the words inside of the for loop to prevent you from having to call lower for every word match and every line
  4. I didn't call lower on the line until after the check because there's no point in lowercasing an empty line

Good luck.

Upvotes: 3

Baruch Spinoza
Baruch Spinoza

Reputation: 21

I'm puzzled about a few things-- answering which may help the community better assist you. Specifically, I can't tell what form the file is in (ie. is it a txt file or a url you're making a request to and parsing the response of). I also can't tell if you're trying to get the entire line, just the url, or just the bit that follows the hash symbol.

Nonetheless, you stated you were looking for the program to output GETBBBBBB , GETAAAAAA , GETCCCCCC, and here's a quick way to get those specific values (assuming the values are in the form of a string):

search = re.findall(r'#(GET[ABC]{6})>', string)

Otherwise, if you're reading from a txt file, this may help:

with open('example_file.txt', 'r') as file:
    lst = []
    for line in file:
        search = re.findall(r'#(GET[ABC]{6})', line)
        if search != []: 
            lst += search
    print(lst)

Of course, these are just some quick suggestions in case they may be of help. Otherwise, please answer the questions I mentioned at the beginning of my response and maybe it can help someone on SO better understand what you're looking to get.

Upvotes: 1

Related Questions