Reputation: 3325
Here is my data sample in a txt file:
1322484979.322313000 85.24.168.19 QQlb-j7itDQ
1322484981.070116000 83.233.56.133 Ne8Bb1d5oyc
1322484981.128791000 83.233.56.133 Ne8Bb1d5oyc
1322484981.431075000 83.233.56.133 Ne8Bb1d5oyc
1322484985.210652000 83.233.57.136 QWUiCAE4E7U
The first column is timestamp, second column is IP address, third one is some hash value.
I want to check, if two or more successive rows have same IP address and hash value, I need to use the last timestamp of the duplicated row to substract the first timestamp of the duplicated row, in this case, is 132248981.431075000-1322484981.070116000
If the result is less than 5, I will only keep the first row (the earliest) in the file.
If the result is more than 5, I will keep the first and the last duplicated row, delete rows between them
Since Im a pretty newbie of python, This problem is a bit complicated for me. I dont know what kind of function is needed, can anyone help a little bit?
Upvotes: 2
Views: 177
Reputation: 80811
In a basic way, it could looks like this :
data = open("data.txt", "r")
last_time = 0.0
last_ip = None
last_hash = None
for line in data:
timestamp, ip, hash_value = line.split()
if ip==last_ip and hash_value==last_hash and float(timestamp) - float(last_time) < 5.0:
print "Remove ", line
else:
print "Keep ", line
last_time, last_ip, last_hash = timestamp, ip, hash_value
Upvotes: 3