Karilyn Lee
Karilyn Lee

Reputation: 37

Python Regex: How do I use regular expression to read in a file with multiple lines, and extract words from each line to create two different lists

country_names.txt is a file with multiple lines, each line containing a European country and a Asian country. Read in each line of text until there is a line with the country names.

Example line inside text file: <td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>

How do I use ONLY ONE regular expression to extract a European country and a Asian country from any line that contains two countries. After extracting the countries, store the European country in a list of European country names and store the Asian country in a list of Asian country names.

When all the lines have been read in, print a count of how many European countries and Asian countries have been read in.

Currently, this is what I have:

import re

with open('country_names.txt') as infile:

for line in infile:

        countries = re.findall("", "", infile) # regex code inside ""s in parenthesis

european_countries = countries.group(1)

asian_countries = countries.group(2)

Upvotes: 2

Views: 325

Answers (3)

brandonbanks
brandonbanks

Reputation: 1295

You can use this regex to pull out the countries. <\s*(td)[^>]*>(\w*)<\s*/\s*(td)> This is selecting the tags where the text inside the tags is a word (i.e. not numbers)

This returns a list of tuples [('td', 'England', 'td'), ('td', 'Japan', 'td')]

I then map over and select the 2nd element in the tuple which is the country.

regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
countries = list(map(lambda x: x[1], re.findall(regex, line)))
print(countries)  # ['England', 'Japan']

One thing to note is you need to use line instead of infile in the loop.

So to put it together:

regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
european_countries = []
asian_countries = []

for line in infile:
    countries = list(map(lambda x: x[1], re.findall(regex, line)))
    european_countries.append(countries[0])
    asian_countries.append(countries[1])

Please note this will not work if you have other <td> tags with text in them. Also the order of the countries is important for this code. But a quick solution to your problem.

Upvotes: 1

Aviad Levy
Aviad Levy

Reputation: 824

For one regex only you should use ^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>. You can play with it here: https://regex101.com/r/q9XHDD/1

When running it on your example you'll get:

>>> re.findall("^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*", "<td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>")
[('England', 'Japan')]

My suggestion to you is not to use re.findall but to use re.match and then you code should be

import re

regex = "^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*"
eu_countries = []
as_countries = []
with open('country_names.txt') as infile:
   for line in infile:
        match = re.match(regex, line )
        if match:
            eu_countries.append(match.group(1))
            as_countries.append(match.group(2))

Upvotes: 3

PythonNerd
PythonNerd

Reputation: 293

f = open('country_names.txt', 'r')
line = f.readlines()
e_countries = []
a_countries = []
for i in line:
  line1 = i.split(', ')[0]
  line2 = i.split(', ')[1]
  e_countries.append(line1)
  a_countries.append(line2)

Upvotes: 0

Related Questions