Reputation: 37
country_names.txt is a file with multiple lines, each line containing a European country and a Asian country. Read in each line of text until there is a line with the country names.
Example line inside text file:
<td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>
How do I use ONLY ONE regular expression to extract a European country and a Asian country from any line that contains two countries. After extracting the countries, store the European country in a list of European country names and store the Asian country in a list of Asian country names.
When all the lines have been read in, print a count of how many European countries and Asian countries have been read in.
Currently, this is what I have:
import re
with open('country_names.txt') as infile:
for line in infile:
countries = re.findall("", "", infile) # regex code inside ""s in parenthesis
european_countries = countries.group(1)
asian_countries = countries.group(2)
Upvotes: 2
Views: 325
Reputation: 1295
You can use this regex to pull out the countries. <\s*(td)[^>]*>(\w*)<\s*/\s*(td)>
This is selecting the tags where the text inside the tags is a word (i.e. not numbers)
This returns a list of tuples
[('td', 'England', 'td'), ('td', 'Japan', 'td')]
I then map over and select the 2nd element in the tuple which is the country.
regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
countries = list(map(lambda x: x[1], re.findall(regex, line)))
print(countries) # ['England', 'Japan']
One thing to note is you need to use line
instead of infile
in the loop.
So to put it together:
regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
european_countries = []
asian_countries = []
for line in infile:
countries = list(map(lambda x: x[1], re.findall(regex, line)))
european_countries.append(countries[0])
asian_countries.append(countries[1])
Please note this will not work if you have other <td>
tags with text in them. Also the order of the countries is important for this code. But a quick solution to your problem.
Upvotes: 1
Reputation: 824
For one regex only you should use ^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>
. You can play with it here: https://regex101.com/r/q9XHDD/1
When running it on your example you'll get:
>>> re.findall("^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*", "<td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>")
[('England', 'Japan')]
My suggestion to you is not to use re.findall
but to use re.match
and then you code should be
import re
regex = "^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*"
eu_countries = []
as_countries = []
with open('country_names.txt') as infile:
for line in infile:
match = re.match(regex, line )
if match:
eu_countries.append(match.group(1))
as_countries.append(match.group(2))
Upvotes: 3
Reputation: 293
f = open('country_names.txt', 'r')
line = f.readlines()
e_countries = []
a_countries = []
for i in line:
line1 = i.split(', ')[0]
line2 = i.split(', ')[1]
e_countries.append(line1)
a_countries.append(line2)
Upvotes: 0