Reputation: 167
I got this messy dataset from a csv-file that contains multiple entries in the same cell. This is how it looks:
file = ('messy.csv')
df = pd.read_csv(file)
df.head()
Folders Files
aa; bb; aa.src aa.xml ; bb.src bb.war ;
cc; cc.pom cc.py cc.js ;
dd; ee; ff; dd.ts dd.js ; ee.py ; ff.xml ff.js ;
In the Folders column the values are separated with a semicolon ";". In the Files column the values are separated with a space and a semicolon " ;". Files that belongs to the same Folder are only separated with a space. I need help to reshape this into a more manageable dataframe, or into a JSON-dict/list. I haven't found many examples in which there are multiple values in the same cell that I can take help from.
Sure, a "manageable" format is kinda ambiguous but anything is better than this...
Something like this maybe:
Folders Files 1 Files 2 Files 3
aa aa.src aa.xml NaN
bb bb.src bb.war NaN
cc cc.pom cc.py cc.js
dd dd.ts dd.js NaN
ee ee.py NaN NaN
ff ff.xml ff.js NaN
Or if there are better ideas I am open for suggestions. After I've reshaped it, it's gonna be converted into a JSON-format.
Upvotes: 0
Views: 56
Reputation: 994
Turning it into a json/dict
Ok so probably not the most efficient solution but it works:
import pandas as pd
# Recreating the dataframe
df = pd.DataFrame({'Folders':["aa; bb;", "cc", "dd; ee; ff;"], 'Files':['aa.src aa.xml ; bb.src bb.war ;', 'cc.pom cc.py cc.js ;', 'dd.ts dd.js ; ee.py ; ff.xml ff.js ;']})
#Split df according to ; and removing leading ;
df = df.apply(lambda x: x.str.rstrip(';').str.split(';'))
print(df)
So now your dataframe looks like this:
Folders Files
0 [aa, bb] [aa.src aa.xml , bb.src bb.war ]
1 [cc] [cc.pom cc.py cc.js ]
2 [dd, ee, ff] [dd.ts dd.js , ee.py , ff.xml ff.js ]
Then I loop through the dataframe to build the dict:
# Creating the dict by looping through the dataframe and number of elements of folders
df_dict=dict()
for index, row in df.iterrows():
for i, key in enumerate(row['Folders']):
df_dict[key.strip()] = row['Files'][i].strip().split(' ')
print(df_dict)
And this is the output:
{'aa': ['aa.src', 'aa.xml'], 'bb': ['bb.src', 'bb.war'], 'cc'
: ['cc.pom', 'cc.py', 'cc.js'], 'dd': ['dd.ts', 'dd.js'], 'ee
': ['ee.py'], 'ff': ['ff.xml', 'ff.js']}
If you can encounter twice the same key, I suggest to use this version of the code who check if the key already exists:
import pandas as pd
# Recreating the dataframe
df = pd.DataFrame({'Folders':["aa; bb;", "cc", "dd; ee; ff;", 'aa'], 'Files':['aa.src aa.xml ; bb.src bb.war ;', 'cc.pom cc.py cc.js ;', 'dd.ts dd.js ; ee.py ; ff.xml ff.js ;', 'aa.tst']})
#Split df according to ; and removing leading ;
df = df.apply(lambda x: x.str.rstrip(';').str.split(';'))
print(df)
df_dict=dict()
for index, row in df.iterrows():
for i, key in enumerate(row['Folders']):
if key.strip() in df_dict:
df_dict[key.strip()] += row['Files'][i].strip().split(' ')
else: df_dict[key.strip()] = row['Files'][i].strip().split(' ')
print(df_dict)
Upvotes: 1