Reputation: 187
I'm manipulating a JSON file from which I have run this code to get the following dataframe:
import pandas as pd
topics = df.set_index('username').popular_board_data.str.extractall(r'name":"([^,]*)')
total = df.set_index('username').popular_board_data.str.extractall(r'totalCount\":([^,}]*)')
data = []
for username in df.username.unique():
for topic in zip(topics[0][username], total[0][username]):
data.append([username, topic])
df_topic = pd.DataFrame(data, columns='username,topic'.split(','))
username topic
0 lukl (Hardware", 80)
1 lukl (Marketplace", 31)
2 lukl (Atari 5200", 27)
3 lukl (Atari 8-Bit Computers", 9)
4 lukl (Modern Gaming", 3)
Now I need to split the information from the column 'topic' into two different columns:
This is the expected result:
username topic _topic _total
0 lukl (Hardware", 80) Hardware 80
1 lukl (Marketplace", 31) Marketplace 31
2 lukl (Atari 5200", 27) Atari 5200 27
3 lukl (Atari 8", 9) Atari 8 9
4 lukl (Modern", 3) Modern 3
I though about doing it with this code:
df_top = df_topic.copy()
df_top['_topic'] = df_topic['topic'].str.split('(').str[1].str.split('",').str[0]
df_top['_total'] = df_topic['topic'].str.split('",').str[1].str.split(')').str[0]
df_top
But I'm getting this error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Upvotes: 4
Views: 1605
Reputation: 828
You can use regular expressions like:
df['_topic'] = df['topic'].str.extract(r'([a-zA-Z]+)')
df['_total'] = df['topic'].str.extract(r'(\d+)')
username topic _topic _total
0 lukl (Hardware", 80) Hardware 80
1 lukl (Marketplace", 31) Marketplace 31
2 lukl (Atari 5200", 27) Atari 5200
3 lukl (Atari 8-Bit Computers", 9) Atari 8
4 lukl (Modern Gaming", 3) Modern 3
Upvotes: 0
Reputation: 862481
I think there are tuples, so use DataFrame
constructor only:
df_topic[['_topic', '_total']]=pd.DataFrame(df_topic['topic'].values.tolist(),
index=df_topic.index)
Better solution with your previous answer data with concat
and DataFrame.reset_index
:
df = [{"username": "last",
"popular_board_data": "{\"boards\":[{\"postCount\":\"75\",\"topicCount\":\"5\",\"name\":\"Hardware\",\"url\",\"totalCount\":80},{\"postCount\":\"20\",\"topicCount\":\"11\",\"name\":\"Marketplace\",\"url\",\"totalCount\":31},{\"postCount\":\"26\",\"topicCount\":\"1\",\"name\":\"Atari 5200\",\"url\",\"totalCount\":27},{\"postCount\":\"9\",\"topicCount\":0,\"name\":\"Atari 8\",\"url\"\"totalCount\":9}"
},
{"username": "truk",
"popular_board_data": "{\"boards\":[{\"postCount\":\"351\",\"topicCount\":\"11\",\"name\":\"Atari 2600\",\"url\",\"totalCount\":362},{\"postCount\":\"333\",\"topicCount\":\"22\",\"name\":\"Hardware\",\"url\",\"totalCount\":355},{\"postCount\":\"194\",\"topicCount\":\"8\",\"name\":\"Marketplace\",\"url\",\"totalCount\":202}"
}]
df = pd.DataFrame(df)
#added " for remove it from output
topics = df.set_index('username').popular_board_data.str.extractall(r'name":"([^,]*)"')
total = df.set_index('username').popular_board_data.str.extractall(r'totalCount\":([^,}]*)')
df1 = pd.concat([topics[0], total[0]], axis=1, keys=['_topic', '_total'])
df1 = df1.reset_index(level=1, drop=True).reset_index()
print (df1)
username _topic _total
0 last Hardware 80
1 last Marketplace 31
2 last Atari 5200 27
3 last Atari 8 9
4 truk Atari 2600 362
5 truk Hardware 355
6 truk Marketplace 202
Upvotes: 2
Reputation: 3739
I am taking topic as string if not string then convert it in string
df = pd.DataFrame(data={"username":['luk1','luk1','luk1'],
'topic':[ '(Hardware, 80)','(Marketplace, 31)', '(Atari 5200, 27)']})
df['_topic'] = df['topic'].apply(lambda x:str(x).split(",")[0][1:])
df['_total'] = df['topic'].apply(lambda x:str(x).split(",")[1][:-1])
username topic _topic _total
0 luk1 (Hardware, 80) Hardware 80
1 luk1 (Marketplace, 31) Marketplace 31
2 luk1 (Atari 5200, 27) Atari 5200 27
Upvotes: 0