Reputation: 143
I have a space seperated txt file like following:
2004 Temperature for KATHMANDU AIRPORT
Tmax Tmin
1 18.8 2.4
2 19.0 1.1
3 18.3 1.7
4 18.3 1.0
5 17.8 1.3
I want to calculate the mean of both Tmax and Tmin seperately. But, I am having hard time reading txt file. I tried this link like.
import re
list_b = []
list_d = []
with open('TA103019.95.txt', 'r') as f:
for line in f:
list_line = re.findall(r"[\d.\d+']+", line)
list_b.append(float(list_line[1])) #appends second column
list_d.append(float(list_line[3])) #appends fourth column
print list_b
print list_d
But, it is giving me error :
IndexError: list index out of range
what is wrong here?
Upvotes: 1
Views: 13762
Reputation: 1921
Is it possible that your file is tab delimited?
For Tab Delimited:
with open('TA103019.95.txt', 'r') as f:
for idx, line in enumerate(f):
if idx > 1:
cols = line.split('\t'): #for space delimited change '\t' to ' '
tmax = float(col[1])
tmin = float(col[2])
#calc mean
mean = (tmax + tmin) / 2
#not sure what you want to do with the result
Upvotes: 0
Reputation: 186
As stated here, re.findall
lists all matches of your regular expression. the expression you define does not match anything in your file and you therefore get an empty array, leading to the error when you try to access list_line[1]
.
r"\d+\.\d+"
, matching any decimal number with at least one digit before the decimal point, that decimal point and at least one digit after it0
and 1
so: import re list_b = [] list_d = []
with open('TA103019.95.txt', 'r') as f:
for line in f:
list_line = re.findall(r'\d+\.\d+', line)
if len(list_line) == 2 :
list_b.append(float(list_line[0])) #appends second column
list_d.append(float(list_line[1])) #appends fourth column
print list_b
print list_d
Upvotes: 1
Reputation: 3385
Full Solution
def readit(file_name,start_line = 2): # start_line - where your data starts (2 line mean 3rd line, because we start from 0th line)
with open(file_name,'r') as f:
data = f.read().split('\n')
data = [i.split(' ') for i in data[start_line:]]
for i in range(len(data)):
row = [(sub) for sub in data[i] if len(sub)!=0]
yield int(row[0]),float(row[1]),float(row[2])
iterator = readit('TA103019.95.txt')
index, tmax, tmin = zip(*iterator)
mean_Tmax = sum(tmax)/len(tmax)
mean_Tmin = sum(tmin)/len(tmin)
print('Mean Tmax: ',mean_Tmax)
print('Mean Tmnin: ',mean_Tmin)
>>> ('Mean Tmax: ', 18.439999999999998)
>>> ('Mean Tmnin: ', 1.5)
Thanks to Dan D. for more Elegant solution
Upvotes: 1
Reputation: 2491
import re
list_b = []
list_d = []
with open('TA103019.95.txt', 'r') as f:
for line in f:
# regex is corrected to match the decimal values only
list_line = re.findall(r"\d+\.\d+", line)
# error condition handled where the values are not found
if len(list_line) < 2:
continue
# indexes are corrected below
list_b.append(float(list_line[0])) #appends second column
list_d.append(float(list_line[1])) #appends fourth column
print list_b
print list_d
I have added my answer with some comments in the code itself.
You were getting the Index out of range error
because your list_line was having only a single element(i.e. 2004 in the first line of file) and you were trying to access the 1st and 3rd index of the list_line.
Upvotes: 1
Reputation: 22992
A simple way to solve that is to use split()
function.
Of course, you need to drop the first two lines:
with io.open("path/to/file.txt", mode="r", encoding="utf-8") as f:
next(f)
next(f)
for line in f:
print(line.split())
You get:
['1', '18.8', '2.4']
['2', '19.0', '1.1']
['3', '18.3', '1.7']
['4', '18.3', '1.0']
['5', '17.8', '1.3']
Quoting the documentation:
If sep is not specified or is
None
, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Upvotes: 4
Reputation: 11913
Simplify your life and avoid 're' for this problem.
Perhaps you are reading the header row mistakenly? If the format of the file is fixed, I usually "burn" the header row with a line read before starting the loop like:
with open(file_name, 'r') as f:
f.readline() # burn the header row
for line in f:
tokens = line.strip().split(' ') # tokenize the row based on spaces
Then you have a list of tokens, which will be strings that you'll need to convert to int or float or whatever and go from there!
Put in a couple print statements to see what you are picking up...
Upvotes: 0