Reputation: 181
Consider I have the following dataframe:
tempDic = {0: {0: 'class([1,0,0,0],"Small-molecule metabolism ").', 1: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 2: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 3: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 4: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 5: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 6: 'class([1,1,0,0],"Degradation ").', 7: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 8: 'function(tb536,[1,1,1,0],\'galE2\',"UDP-glucose 4-epimerase").', 9: 'function(tb620,[1,1,1,0],\'galK\',"galactokinase").', 10: 'function(tb619,[1,1,1,0],\'galT\',"galactose-1-phosphate uridylyltransferase C-term").', 11: 'class([1,1,1,0],"Carbon compounds ").', 12: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 13: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 14: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 15: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 16: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 17: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 18: 'class([1,1,1,0],"xyz ").'}}
df = pd.DataFrame(tempDic)
print(df)
0
0 class([1,0,0,0],"Small-molecule metabolism ").
1 function(tb186,[1,1,1,0],'bglS',"beta-glucosid...
2 function(tb2202,[1,1,1,0],'cbhK',"carbohydrate...
3 function(tb727,[1,1,1,0],'fucA',"L-fuculose ph...
4 function(tb1731,[1,1,1,0],'gabD1',"succinate-s...
5 function(tb234,[1,1,1,0],'gabD2',"succinate-se...
6 class([1,1,0,0],"Degradation ").
7 function(tb501,[1,1,1,0],'galE1',"UDP-glucose ...
8 function(tb536,[1,1,1,0],'galE2',"UDP-glucose ...
9 function(tb620,[1,1,1,0],'galK',"galactokinase").
10 function(tb619,[1,1,1,0],'galT',"galactose-1-p...
11 class([1,1,1,0],"Carbon compounds ").
12 function(tb186,[1,1,1,0],'bglS',"beta-glucosid...
13 function(tb2202,[1,1,1,0],'cbhK',"carbohydrate...
14 function(tb727,[1,1,1,0],'fucA',"L-fuculose ph...
15 function(tb1731,[1,1,1,0],'gabD1',"succinate-s...
16 function(tb234,[1,1,1,0],'gabD2',"succinate-se...
17 function(tb501,[1,1,1,0],'galE1',"UDP-glucose ...
18 class([1,1,1,0],"xyz ").
What I need is a strategy that will give me a result like this:
Class Count
Small-molecule metabolism 5
Degradation 4
Carbon compounds 6
xyz 0
Each row that starts with "class" contains the name of the class in double quotes, for example, "Small-molecule metabolism" in the first row. This row is then followed by rows starting with "function". We just need to count those rows that start with "function" and put that count in front of that class name. A class that is not followed by "function" rows should be assigned the value of 0, meaning that the class has zero functions.
Upvotes: 2
Views: 58
Reputation: 1413
Try this:
import re
from itertools import compress
tempDic = {0: {0: 'class([1,0,0,0],"Small-molecule metabolism ").', 1: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 2: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 3: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 4: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 5: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 6: 'class([1,1,0,0],"Degradation ").', 7: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 8: 'function(tb536,[1,1,1,0],\'galE2\',"UDP-glucose 4-epimerase").', 9: 'function(tb620,[1,1,1,0],\'galK\',"galactokinase").', 10: 'function(tb619,[1,1,1,0],\'galT\',"galactose-1-phosphate uridylyltransferase C-term").', 11: 'class([1,1,1,0],"Carbon compounds ").', 12: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 13: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 14: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 15: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 16: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 17: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 18: 'class([1,1,1,0],"xyz ").'}}
df = pd.DataFrame(tempDic)
df_final=pd.DataFrame()
df_final['class']=[i[0] for i in list(compress([re.findall('"([^"]*)"',i) for i in df[0]],[df[0].str.contains('class').tolist()][0]))]
df_final['count']=pd.Series(df[df[0].str.contains('class')].index).diff().dropna().reset_index(drop=True).sub(1)
df_final['count'].fillna(0,inplace=True)
output:
df_final
Out[165]:
class count
0 Small-molecule metabolism 5.0
1 Degradation 4.0
2 Carbon compounds 6.0
3 xyz 0.0
Upvotes: 0
Reputation: 862611
Use Series.str.startswith
for mask, get values between ""
by Series.str.extract
and after forward filling missing values use GroupBy.size
with subtract 1
:
df['Class'] = df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False)
df['Class'] = df['Class'].ffill()
s = df.groupby('Class', sort=False).size().sub(1).reset_index(name='Count')
print (s)
Class Count
0 Small-molecule metabolism 5
1 Degradation 4
2 Carbon compounds 6
3 xyz 0
Details of steps:
print(df.loc[df[0].str.startswith('class'), 0])
0 class([1,0,0,0],"Small-molecule metabolism ").
6 class([1,1,0,0],"Degradation ").
11 class([1,1,1,0],"Carbon compounds ").
18 class([1,1,1,0],"xyz ").
Name: 0, dtype: object
print (df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False))
0 Small-molecule metabolism
6 Degradation
11 Carbon compounds
18 xyz
Name: 0, dtype: object
df['Class'] = df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False)
print (df['Class'])
0 Small-molecule metabolism
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 Degradation
7 NaN
8 NaN
9 NaN
10 NaN
11 Carbon compounds
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 xyz
Name: Class, dtype: object
df['Class'] = df['Class'].ffill()
print (df['Class'])
0 Small-molecule metabolism
1 Small-molecule metabolism
2 Small-molecule metabolism
3 Small-molecule metabolism
4 Small-molecule metabolism
5 Small-molecule metabolism
6 Degradation
7 Degradation
8 Degradation
9 Degradation
10 Degradation
11 Carbon compounds
12 Carbon compounds
13 Carbon compounds
14 Carbon compounds
15 Carbon compounds
16 Carbon compounds
17 Carbon compounds
18 xyz
Name: Class, dtype: object
print (df.groupby('Class', sort=False).size())
Class
Small-molecule metabolism 6
Degradation 5
Carbon compounds 7
xyz 1
dtype: int64
df1 = df.groupby('Class', sort=False).size().sub(1).reset_index(name='Count')
print (df1)
Class Count
0 Small-molecule metabolism 5
1 Degradation 4
2 Carbon compounds 6
3 xyz 0
Upvotes: 2
Reputation: 6564
Here you go:
import re
tempDic = {0: {0: 'class([1,0,0,0],"Small-molecule metabolism ").', 1: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 2: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 3: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 4: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 5: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 6: 'class([1,1,0,0],"Degradation ").', 7: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 8: 'function(tb536,[1,1,1,0],\'galE2\',"UDP-glucose 4-epimerase").', 9: 'function(tb620,[1,1,1,0],\'galK\',"galactokinase").', 10: 'function(tb619,[1,1,1,0],\'galT\',"galactose-1-phosphate uridylyltransferase C-term").', 11: 'class([1,1,1,0],"Carbon compounds ").', 12: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 13: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 14: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 15: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 16: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 17: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 18: 'class([1,1,1,0],"xyz ").'}}
df = pd.DataFrame(tempDic)
df.columns = ['text']
df = df.loc[df.text.str.startswith('class', na=False)] # leave only rows starting with 'class'
df['class'] = df['text'].apply(lambda x: re.findall(r"['\"](.*?)['\"]", x)[0]) # Extract the value between the double quotes
df.groupby(['class']).count() # Count the classes
Upvotes: 0