Reputation: 115
I am trying to get the data from URL.below is the URL Format.
What I am trying to do
1)read line by line and find if the line contains the desired keyword.
3)If yes then store the previous line's content "GETCONTENT" in a list
<http://www.example.com/XYZ/a-b-c/w#>DONTGETCONTENT
a <http://www.example.com/XYZ/mount/v1#NNNN> ,
<http://www.w3.org/2002/w#Individual> ;
<http://www.w3.org/2000/01/rdf-schema#label>
"some content , "some url content ;
<http://www.example.com/XYZ/log/v1#hasRelation>
<http://www.example.com/XYZ/data/v1#Change> ;
<http://www.example.com/XYZ/log/v1#ServicePage>
<https://dev.org.net/apis/someLabel> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some API Content .
<http://www.example.com/XYZ/model/v1#GETBBBBBB>
a <http://www.w3.org/01/07/w#BBBBBB> ;
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#xyz> ;
<http://www.w3.org/2000/01/schema#label1>
"some content , "some url content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.w3.org/2001/XMLSchema#boolean> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some description .
<http://www.example.com/XYZ/datamodel-ee/v1#GETAAAAAA>
a <http://www.w3.org/01/07/w#AAAAAA> ;
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#Version> ;
<http://www.w3.org/2000/01/schema#label>
"some content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.example.com/XYZ/data/v1#uuu> .
<http://www.example.com/XYZ/datamodel/v1#GETCCCCCC>
a <http://www.w3.org/01/07/w#CCCCCC ,
<http://www.w3.org/2002/07/w#Name>
<http://www.w3.org/2000/01/schema#domain>
<http://www.example.com/XYZ/data/v1#xyz> ;
<http://www.w3.org/2000/01/schema#label1>
"some content , "some url content ;
<http://www.w3.org/2000/01/schema#range>
<http://www.w3.org/2001/XMLSchema#boolean> ;
<http://www.example.com/XYZ/log/v1#Description>
"Some description .
below is the code i tried so far but it is printing all the content of the file
import re
def read_from_url():
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
url_link = "examle.com"
html = urlopen(url_link)
previous=None
for line in html:
previous=line
line = re.search(r"^(\s*a\s*)|\#GETBBBBBB|#GETAAAAAA|#GETCCCCCC\b",
line.decode('UTF-8'))
print(previous)
if __name__ == '__main__':
read_from_url()
Expected output:
GETBBBBBB , GETAAAAAA , GETCCCCCC
Thanks in advance!!
Upvotes: 1
Views: 4902
Reputation: 5011
When it comes to reading data from URLs, the requests
library is much simpler:
import requests
url = "https://www.example.com/your/target.html"
text = requests.get(url).text
If you haven't got it installed you could use the following to do so:
pip3 install requests
Next, why go through the hassle of shoving all of your words into a single regular expression when you could use a word array and then use a for loop instead?
For example:
search_words = "hello word world".split(" ")
matching_lines = []
for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue
Then you'd output the result, like this:
print(matching_lines)
Running this where the text
variable equals:
"""
this word will save the line
ignore me!
hello my friend!
what about me?
"""
Should output:
[
"this word will save the line",
"hello my friend!"
]
You could make the search case insensitive by using the lower
method, like this:
search_words = [word for word in "hello word world".lower().split(" ")]
matching_lines = []
for (i, line) in enumerate(text.split()):
line = line.strip()
if len(line) < 1:
continue
line = line.lower()
for word i search_words:
if re.search("\b" + word + "\b", line):
matching_lines.append(line)
continue
Notes and information:
continue
keyword prevents you from searching for more than one word match in the current lineenumerate
function allows us to iterate of the index
and the current linelower
function for the words inside of the for
loop to prevent you from having to call lower
for every word match and every linelower
on the line until after the check because there's no point in lowercasing an empty lineGood luck.
Upvotes: 3
Reputation: 21
I'm puzzled about a few things-- answering which may help the community better assist you. Specifically, I can't tell what form the file is in (ie. is it a txt file or a url you're making a request to and parsing the response of). I also can't tell if you're trying to get the entire line, just the url, or just the bit that follows the hash symbol.
Nonetheless, you stated you were looking for the program to output GETBBBBBB , GETAAAAAA , GETCCCCCC
, and here's a quick way to get those specific values (assuming the values are in the form of a string):
search = re.findall(r'#(GET[ABC]{6})>', string)
Otherwise, if you're reading from a txt file, this may help:
with open('example_file.txt', 'r') as file:
lst = []
for line in file:
search = re.findall(r'#(GET[ABC]{6})', line)
if search != []:
lst += search
print(lst)
Of course, these are just some quick suggestions in case they may be of help. Otherwise, please answer the questions I mentioned at the beginning of my response and maybe it can help someone on SO better understand what you're looking to get.
Upvotes: 1