Reputation: 341
I have a pandas dataframe consisting of strings, i.e 'P1', 'P2', 'P3', ..., null.
When I try to concatenate this data frame with another, all of the strings get replaced with 'NaN'.
See my code below:
descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json')
descriptions = descriptions.reset_index(drop=1)
descriptions['desc'] = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what'))
f1=pd.DataFrame(descriptions['desc'])
bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json')
bugPrior = bugPrior.reset_index(drop=1)
bugPrior['priority'] = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what'))
f2=pd.DataFrame(bugPrior['priority'])
df = pd.concat([f1,f2])
print(df.head())
The output is as follows:
desc priority
0 Usability issue with external editors (1GE6IRL) NaN
1 API - VCM event notification (1G8G6RR) NaN
2 Would like a way to take a write lock on a tea... NaN
3 getter/setter code generation drops "F" in "..... NaN
4 Create Help Index Fails with seemingly incorre... NaN
Any ideas as to how I might stop this from happening?
Ultimately, my goal is to have everything in a single data frame so that I might removes all rows with "null" values. It would also help later on in the code.
Thanks.
Upvotes: 1
Views: 797
Reputation: 862681
I think the best there is not create DataFrames from columns:
descriptions = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/short_desc.json')
descriptions = descriptions.reset_index(drop=1)
#get Series to f1
f1 = descriptions.short_desc.apply(operator.itemgetter(0)).apply(operator.itemgetter('what'))
print (f1.head())
bugPrior = pd.read_json('https://raw.githubusercontent.com/ansymo/msr2013-bug_dataset/master/data/v02/eclipse/priority.json')
bugPrior = bugPrior.reset_index(drop=1)
#get Series to f2
f2 = bugPrior.priority.apply(operator.itemgetter(0)).apply(operator.itemgetter('what'))
print (f2.head())
Then use same solution as cᴏʟᴅsᴘᴇᴇᴅ answer:
df = pd.concat([f1,f2], axis=1).dropna().reset_index(drop=True)
print (df.head())
short_desc priority
0 Create Help Index Fails with seemingly incorre... P3
1 Internal compiler error when compiling switch ... P3
2 Default text sizes in org.eclipse.jface.resour... P3
3 [Presentations] [ViewMgmt] Holding mouse down ... P3
4 Parsing of function declarations in stdio.h is... P2
Upvotes: 2
Reputation: 402523
Assuming you want to concatenate those columns horizontally, you'll need to pass axis=1
to pd.concat
, because by default, concatenation is vertical.
df = pd.concat([f1,f2], axis=1)
To drop those NaN
rows, you should be able to use df.dropna
. Call df.reset_index
after.
df = pd.concat([f1, f2], 1)
df = df.dropna().reset_index(drop=True)
print(df.head(10))
desc priority
0 Create Help Index Fails with seemingly incorre... P3
1 Internal compiler error when compiling switch ... P3
2 Default text sizes in org.eclipse.jface.resour... P3
3 [Presentations] [ViewMgmt] Holding mouse down ... P3
4 Parsing of function declarations in stdio.h is... P2
5 CCE in RenameResourceAction while renaming ele... P3
6 Option to prevent cursor from moving off end o... P3
7 Tasks section in the user doc is very stale P3
8 Importing existing project with different case... P3
9 Workspace in use --> choose new workspace but ... P3
Printing out df.priority.unique()
, we see there are 5 unique priorities:
print(df.priority.unique())
array(['P3', 'P2', 'P4', 'P1', 'P5'], dtype=object)
Upvotes: 2