Jasper
Jasper

Reputation: 2231

Pandas, concatenate certain columns if other columns are empty

I've got a CSV file that is supposed to look like this:

ID, years_active, issues
-------------------------------
'Truck1', 8, 'In dire need of a paintjob'
'Car 5', 3,  'To small for large groups'

However, the CSV is somewhat malformed and currently looks like this.

ID, years_active, issues
------------------------
'Truck1', 8, 'In dire need'
'','', 'of a'
'','', 'paintjob'
'Car 5', 3, 'To small for'
'', '', 'large groups'

Now, I am able to identify faulty rows by the lack of an 'ID' and 'years_active' value and would like to append the value of 'issues of that row to the last preceding row that had 'ID' and 'years_active' values.

I am not very experienced with pandas, but came up with the following code:

for index, row in df.iterrows():
        if row['years_active'] == None:
            df.loc[index-1]['issues'] += row['issues']

Yet - the IF condition fails to trigger. Is the thing I am trying to do possible? And if so, does anyone have an idea what I am doing wrong?

Upvotes: 2

Views: 1643

Answers (3)

Bill Bell
Bill Bell

Reputation: 21643

It might be worth mentioning in the context of this question that there is an often overlooked way of processing awkward input by using the StringIO library.

The essential point is that read_csv can read from a StringIO 'file'.

In this case, I arrange to discard single quotes and multiple commas that would confuse read_csv, and I append the second and subsequent lines of input to the first line, to form complete, conventional csv lines form read_csv.

Here is what read_csv receives.

       ID   years_active                      issues
0  Truck1              8  In dire need of a paintjob
1   Car 5              3   To small for large groups

The code is ugly but easy to follow.

import pandas as pd
from io import StringIO

for_pd = StringIO()
with open('jasper.txt') as jasper:
    print (jasper.readline(), file=for_pd)
    line = jasper.readline()
    complete_record = ''
    for line in jasper:
        line = ''.join(line.rstrip().replace(', ', ',').replace("'", ''))
        if line.startswith(','):
            complete_record += line.replace(',,', ',').replace(',', ' ')
        else:
            if complete_record:
                print (complete_record, file=for_pd)
            complete_record = line
if complete_record:
    print (complete_record, file=for_pd)

for_pd.seek(0)

df = pd.read_csv(for_pd)
print (df)

Upvotes: 1

rnso
rnso

Reputation: 24545

Following uses a for loop to find and add strings (dataframe from JonClements' answer):

df = pd.DataFrame({
    'ID': ['Truck1', '', '', 'Car 5', ''],
    'years_active': [8, '', '', 3, ''],
    'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})


ss = ""; ii = 0; ilist = [0]
for i in range(len(df.index)): 
    if i>0 and df.ID[i] != "":
        df.issues[ii] = ss
        ss = df.issues[i]
        ii = i
        ilist.append(ii)
    else: 
        ss += ' '+df.issues[i]
df.issues[ii] = ss
df = df.iloc[ilist]
print(df)

Output:

       ID                       issues years_active
0  Truck1   In dire need of a paintjob            8
3   Car 5    To small for large groups            3

Upvotes: 2

Jon Clements
Jon Clements

Reputation: 142156

Given your sample input:

df = pd.DataFrame({
    'ID': ['Truck1', '', '', 'Car 5', ''],
    'years_active': [8, '', '', 3, ''],
    'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})

You can use:

new_df = df.groupby(df.ID.replace('', method='ffill')).agg({'years_active': 'first', 'issues': ' '.join})

Which'll give you:

        years_active                      issues
ID                                              
Car 5              3   To small for large groups
Truck1             8  In dire need of a paintjob

So what we're doing here is forward filling the non-blank IDs into subsequent blank IDs and using those to group the related rows. We then aggregate to take the first occurrence of the years_active and join together the issues columns in the order they appear to create a single result.

Upvotes: 3

Related Questions