AKJ
AKJ

Reputation: 25

How to remove Duplicates in .txt file

I have a .txt file with the below entries:-

Apples  51824
Oranges  131236
Peaches  6564
Apples  5879
Peaches 69878

I am trying to remove the entire row (when duplicate entries are found) from this file whenever a word (say Apples) matches in that row (keeping in mind that the entry with the highest value stays).

What I presently do:-

1. Open the file in Excel.
2. Go to Data --> Remove Duplicates

The issue with this approach according to me is that I am not sure whether the end result gives me the data with highest values all the time or not.

So, How can it be programmatically (in python, preferably) done?

Upvotes: 2

Views: 1128

Answers (6)

Ivan Cvetković
Ivan Cvetković

Reputation: 174

Use dictionary to remember best value/line pair for each fruit:

results = {}

with open('file.txt') as f:
    for line in f:
        fruit, value = line.split()
        value = int(value)
        if fruit not in results or results[fruit][0] < value:
            results[fruit] = (value, line.strip())

print('\n'.join(item[1] for item in results.values()))

Upvotes: 0

Ahmed Abuthwabah
Ahmed Abuthwabah

Reputation: 299

Here are 2 solutions one in Python and another in Nodejs without using third party libraries:

Python:

import re
import json

with open('data.txt', 'r') as file:
    lines = file.read()

lines = lines.split('\n')
fruit = {}

for line in lines:
    key, value = re.split(r'\s{4}', line)
    if (key not in fruit or int(fruit[key]) < int(value)):
        fruit[key] = value


fruit = json.dumps(fruit)
fruit = re.sub(r'["{}:]', '', fruit)
fruit = re.sub(r', ', '\n', fruit)

with open('fruits.txt', 'w') as file:
    file.write(fruit)

Nodejs:

import fs from 'fs'

const file = fs.readFileSync('data.txt', 'utf8');
const lines = file.split('\n');
let fruit = {}

for (const line of lines) {
    const [key, value] = line.split(/\s{4}/)
    !fruit[key] || +fruit[key] < +value ? fruit[key] = value : null
}

fruit = JSON.stringify(fruit)
    .replace(/["{}]/g, '')
    .replace(/:/g, ' ')
    .replace(/,/g, '\n')

fs.writeFileSync('fruits.txt', fruit)

Upvotes: 2

s3dev
s3dev

Reputation: 9701

Here's a quick solution in just a few lines, and outputs a nice and flat CSV file.

Code:

import pandas as pd

with open('apples.txt') as f:
    text = [[i, int(j)] for i, j in [i.strip().split() for i in f.readlines()]]

(pd.DataFrame(text, columns=['fruit', 'count'])
 .groupby('fruit')
 .agg({'count': 'max'})
 .reset_index()
 .to_csv('apples_out.txt', index=False))

Output:

fruit,count
Apples,51824
Oranges,131236
Peaches,69878

Upvotes: 0

luckyCasualGuy
luckyCasualGuy

Reputation: 691

from pathlib import Path
import pandas as pd
import numpy as np

textFile = Path("./sample1.txt")

text = textFile.read_text()

rows = text.split("\n")
entries = [x.split("  ") for x in rows]

data = { 
    "Fruits": [x[0] for x in entries],
    "Values": [x[1] for x in entries]
    }

df = pd.DataFrame(data)
new_df = df.groupby(["Fruits"]).max()

new_df.reset_index(inplace=True)

np.savetxt("./out.txt", new_df.values, fmt='%s')

Example:

sample1.txt

Apples  51824
Oranges  131236
Peaches  6564
Apples  5879
Peaches  69878

out.txt

Apples 5879
Oranges 131236
Peaches 69878

Upvotes: 0

woblob
woblob

Reputation: 1377

That would probably work


from collections import defaultdict

filename1 = ""
filename2 = ""

words = defaultdict(int)
with open(filename1) as f1:
    for line in f1:
        word, value = line.strip().split() 
        if int(value) > words[word]:
            words[word] = int(value) 

with open(filename2, "w") as f2:
    for line in words.items():
        f2.write(line)

If you have pandas data frame then:

import pandas 
df = pandas.read_csv(filepath)
result = df.groupby('Name').agg({'values': 'max'}) 
print(result)

Upvotes: 0

Daniel Nudelman
Daniel Nudelman

Reputation: 401

The intuitive way is to use dictionaries:

f = open('test.txt', 'r')
lines = f.readlines()

my_dict = {};
for line in lines:
    s_line = line.split()
    if s_line[0] in my_dict and my_dict[s_line[0]] < int(s_line[1]):
        my_dict[s_line[0]] = int(s_line[1])
    else:
        my_dict[s_line[0]] = int(s_line[1])

new_f = open('test_no_duplicates.txt', 'w')
for key in my_dict:
    new_f.write(key + " " + str(my_dict[key]) + "\n")

f.close()
new_f.close()

Upvotes: 0

Related Questions