Reputation: 1

finding the same words in two files and leaving out not repeated ones in python

I have to write a program that correlates smoking with lung cancer risk. For that I have data in two files. My code is computing the data given in the same lines (eg:America,23.3 with Spain,77.9 and Italy,24.2 with Russia,60.8) How to modify my code so that it computes the numbers of the same countries and leaves out the countries that occur only in one file (it shouldn't compute Germany, France, China, Korea because they are only in one file) Thank you so much for your help in advance:)

smoking file:

Country, Percent Cigarette Smokers Data

America,23.3

Italy,24.2

Russia,23.7

France,14.9

England,17.9

Spain,17

Germany,21.7

second file:

Cases Lung Cancer per 100000 

Spain,77.9

Russia,60.8

Korea,61.3

America,73.3

China,66.8

Vietnam,64.5

Italy,43.9

and my code:

def readFiles(smoking_datafile, cancer_datafile):
'''
    Reads the data from the provided file objects smoking_datafile
    and cancer_datafile. Returns a list of the data read from each
    in a tuple of the form (smoking_datafile, cancer_datafile).
'''

# init
smoking_data = []
cancer_data = []
empty_str = ''

# read past file headers
smoking_datafile.readline()
cancer_datafile.readline()

# read data files
eof = False

while not eof:

    # read line of data from each file
    s_line = smoking_datafile.readline()
    c_line = cancer_datafile.readline()

    # check if at end-of-file of both files
    if s_line == empty_str and c_line == empty_str:
        eof = True

    # check if end of smoking data file only
    elif s_line == empty_str:
        raise OSError('Unexpected end-of-file for smoking data file')

    # check if at end of cancer data file only
    elif c_line == empty_str:
        raise OSError('Unexpected end-of-file for cancer data file')

    # append line of data to each list
    else:
        smoking_data.append(s_line.strip().split(','))
        cancer_data.append(c_line.strip().split(','))

# return list of data from each file
return (smoking_data, cancer_data)


def calculateCorrelation(smoking_data, cancer_data):
    '''
        Calculates and returns the correlation value for the data
        provided in lists smoking_data and cancer_data
    '''    

# init
sum_smoking_vals = sum_cancer_vals = 0
sum_smoking_sqrd = sum_cancer_sqrd = 0
sum_products = 0

# calculate intermediate correlation values
num_values = len(smoking_data)

for k in range(0,num_values):

    sum_smoking_vals = sum_smoking_vals + float(smoking_data[k][1])
    sum_cancer_vals = sum_cancer_vals + float(cancer_data[k][1])

    sum_smoking_sqrd = sum_smoking_sqrd +  \
                          float(smoking_data[k][1]) ** 2
    sum_cancer_sqrd = sum_cancer_sqrd +  \
                          float(cancer_data[k][1]) ** 2

    sum_products = sum_products + float(smoking_data[k][1]) *  \
                   float(cancer_data[k][1])

# calculate and display correlation value
numer = (num_values * sum_products) - \
        (sum_smoking_vals * sum_cancer_vals)

denom = math.sqrt(abs( \
    ((num_values * sum_smoking_sqrd) - (sum_smoking_vals ** 2)) * \
    ((num_values * sum_cancer_sqrd) - (sum_cancer_vals ** 2)) \
    ))

return numer / denom

Upvotes: 0

Answers (2)

Red

Reputation: 27577

This will return a list of all the countries that have datas, along with the data:

l3 = []
with open('smoking.txt','r') as f1, open('cancer.txt','r') as f2:
    l1, l2 = f1.readlines(), f2.readlines()

for s1 in l1:
    for s2 in l2:
        if s1.split(',')[0] == s2.split(',')[0]:
            cty = s1.split(',')[0]
            smk = s1.split(',')[1].strip()
            cnr = s2.split(',')[1].strip()
            l3.append(f"{cty}: smoking: {smk}, cancer: {cnr}")

print(l3)

Output:

['Spain: smoking: 77.9, cancer: 17', 'Russia: smoking: 60.8, cancer: 23.7', 'America: smoking: 73.3, cancer: 23.3', 'Italy: smoking: 43.9, cancer24.2']

Upvotes: 0

0p3r4t0r

Reputation: 693

Let's just focus on getting the data into a format that is easy to work with. The code below will get you a dictionary of the form ...

smokers_cancer_data = {
    'America': {
        'smokers': '23.3',
        'cancer': '73.3'
    }, 
    'Italy': {
        'smokers': '24.2',
        'cancer': '43.9'
    }, 
    ...
}

Once you have this you can get any values you need and perform your calculations. See the code below.

def read_data(filename: str) -> dict:
    with open(filename, 'r') as file:
        next(file) # Skip the header
        data = dict();
        for line in file:
            cleaned_line = line.rstrip()
            # Skip blank lines
            if cleaned_line: 
                data_item = (cleaned_line.split(','))
                data[data_item[0]] = float(data_item[1])
    return data


# Load data into python dictionaries
smokers_data = read_data('smokersData.txt')
cancer_data = read_data('lungCancerData.txt')


# Build one dictionary that is easy to work with
smokers_cancer_data = dict()
for (key, value) in smokers_data.items():
    if key in cancer_data:
        smokers_cancer_data[key] = {
            'smokers': smokers_data[key],
            'cancer' : cancer_data[key]  
        }

print(smokers_cancer_data)

For example, if you want to calculate the sum of the smoker and cancer values.

smokers_total = 0
cancer_total = 0
for (key, value) in smokers_cancer_data.items():
    smokers_total += value['smokers']
    cancer_total += value['cancer']

Upvotes: 2

finding the same words in two files and leaving out not repeated ones in python

Answers (2)

Related Questions