Reputation: 83
I'm trying to create new columns by the fish species name and have the integer as the value, keeping the indexing to do a dataframe join afterwards.
import pandas as pd
df = pd.read_csv("fishCounts.csv",index_col=0)
countsdf = df[["Fish Count"]].copy()
countsdf.head()
Fish Count
0 38 Sand Bass, 16 Sculpin, 10 Blacksmith
1 138 Sculpin, 28 Sand Bass
2 150 Sculpin Released, 102 Sculpin, 40 Sanddab
3 156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3 ...
4 161 Sculpin
countsdf.columns = ["fish"]
countsdf.fish = countsdf.fish.str.split(", ", expand=False)
countsdf.head()
fish
0 [38 Sand Bass, 16 Sculpin, 10 Blacksmith]
1 [138 Sculpin, 28 Sand Bass]
2 [150 Sculpin Released, 102 Sculpin, 40 Sanddab]
3 [156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3...
4 [161 Sculpin]
Here's where I'm not sure where to go. Iterate through the dataframe rows? Make a list of dictionaries? Could I have imported the data differently to make this easier?
Edit: This is what I'm trying to get to.
Sand Bass Sculpin Blacksmith Sculpin Released Sanddab Black Croaker
0 38 16 10
1 28 138
2 102 150 40
3 29 156 5
4 161
Upvotes: 3
Views: 114
Reputation: 23099
We can use str.split
and str.extract
with stack
:
s = df['Fish Count'].str.split(',',expand=True).stack()
s.str.extract('(\d+)(\D+)')
yields -
0 1
0 0 38 Sand Bass
1 16 Sculpin
2 10 Blacksmith
1 0 138 Sculpin
1 28 Sand Bass
2 0 150 Sculpin Released
1 102 Sculpin
2 40 Sanddab
3 0 156 Sculpin
1 29 Sand Bass
2 5 Black Croaker
3 3 ...
4 0 161 Sculpin
then it's up to you in regards to the format you want/need.
i.e
s.str.extract('(\d+)(\D+)').groupby(level=[1]).agg(list)
0 1
0 [38, 138, 150, 156, 161] [ Sand Bass, Sculpin, Sculpin Released, Scu...
1 [16, 28, 102, 29] [ Sculpin, Sand Bass, Sculpin, Sand Bass]
2 [10, 40, 5] [ Blacksmith, Sanddab, Black Croaker]
3 [3] [ ...]
or
s.str.extract('(\d+)(\D+)').unstack(1)
0 1
0 1 2 3 0 1 2 3
0 38 16 10 NaN Sand Bass Sculpin Blacksmith NaN
1 138 28 NaN NaN Sculpin Sand Bass NaN NaN
2 150 102 40 NaN Sculpin Released Sculpin Sanddab NaN
3 156 29 5 3 Sculpin Sand Bass Black Croaker ...
4 161 NaN NaN NaN Sculpin NaN NaN NaN
or
s.str.extract('(\d+)(\D+)').values
array([['38', ' Sand Bass'],
['16', ' Sculpin'],
['10', ' Blacksmith'],
['138', ' Sculpin'],
['28', ' Sand Bass'],
['150', ' Sculpin Released'],
['102', ' Sculpin'],
['40', ' Sanddab'],
['156', ' Sculpin'],
['29', ' Sand Bass'],
['5', ' Black Croaker'],
['3', ' ...'],
['161', ' Sculpin']], dtype=object)
which you can turn into a dict.
# actually i'd use fish : num -
# sorry closed my ide keys can only be unique in a dict.
{num : fish for num, fish in s.str.extract('(\d+)(\D+)').values}
{'38': ' Sand Bass',
'16': ' Sculpin',
'10': ' Blacksmith',
'138': ' Sculpin',
'28': ' Sand Bass',
'150': ' Sculpin Released',
'102': ' Sculpin',
'40': ' Sanddab',
'156': ' Sculpin',
'29': ' Sand Bass',
'5': ' Black Croaker',
'3': ' ...',
'161': ' Sculpin'}
Upvotes: 2
Reputation: 83
Using @Manakin's answer to get to this multi-indexed dataframe:
0 1
0 0 38 Sand Bass
1 16 Sculpin
2 10 Blacksmith
1 0 138 Sculpin
1 28 Sand Bass
2 0 150 Sculpin Released
1 102 Sculpin
2 40 Sanddab
3 0 156 Sculpin
1 29 Sand Bass
2 5 Black Croaker
4 0 161 Sculpin
I then renamed columns, stripped leading and ending white-space for 'species', switched column order, and set index names.
s.columns = ['num','species']
s.species = s.species.str.strip()
s = s.reindex(['species','num'],axis=1)
s.index.names = ['a','b']
s.head()
species num
a b
0 0 Sand Bass 38
1 Sculpin 16
2 Blacksmith 10
1 0 Sculpin 138
1 Sand Bass 28
Then I flattened and reset indices, and dropped b index.
s_flat = s.reset_index()
s_reindexed = s_flat.set_index(['a','species'])
s_reindexed = s_reindexed.drop(columns='b')
s_reindexed.head()
num
a species
0 Sand Bass 38
Sculpin 16
Blacksmith 10
1 Sculpin 138
Sand Bass 28
Finally I unstacked and dropped the columnar multi-index level. I had a Null column I had to remove as well
s_reindexed = s_reindexed.unstack(1)
s_reindexed.columns = s_reindexed.columns.droplevel(0)
s_reset = s_reindexed.drop(columns=np.nan)
s_reset .head()
species Albacore Barracuda Barracuda Released Bat Ray Released Black Croaker Black Seabass Released Blacksmith Blue Perch Bluefin Tuna Bocaccio ...
a
0 NaN NaN NaN NaN NaN NaN 10 NaN NaN NaN ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
3 NaN NaN NaN NaN 5 NaN NaN 3 NaN NaN ...
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Upvotes: 0
Reputation: 26676
something similar to @Manakin
Turn Fish Count
int list
df['Fish Count']=df['Fish Count'].str.split(',')
Explode to separate each fish with its id
df2=df.explode('Fish Count')
Create dictionary. Here, I use list comprehension to derive key and value after splitting values in Fish Count
by white space after digit
{i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}
Outcome
{'38': 'Sand Bass',
' 16': 'Sculpin',
' 10': 'Blacksmith',
'138': 'Sculpin',
' 28': 'Sand Bass',
'150': 'Sculpin Released',
' 102': 'Sculpin',
' 40': 'Sanddab',
'156': 'Sculpin',
' 29': 'Sand Bass',
' 5': 'Black Croaker',
'161': 'Sculpin'}
Can print if needed
print(pd.DataFrame.from_dict({i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}, orient='index'))
0
38 Sand Bass
16 Sculpin
10 Blacksmith
138 Sculpin
28 Sand Bass
150 Sculpin Released
102 Sculpin
40 Sanddab
156 Sculpin
29 Sand Bass
5 Black Croaker
161 Sculpin
Upvotes: 2
Reputation: 5183
First you need to explode the lists you made and then you can use extract with regex twice, once to match numbers and then match text.
With the data
data = '38 Sand Bass, 16 Sculpin, 10 Blacksmith\n138 Sculpin, 28 Sand Bass\n150 Sculpin Released, 102 Sculpin, 40 Sanddab\n156 Sculpin, 29 Sand Bass, 5 Black Croaker\n161 Sculpin'
df = pd.DataFrame(data.split('\n'), columns=['Fish Count'])
Do
countsdf = df['Fish Count'].str.split(', ')
countsdf = countsdf.explode('Fish Count').rename('fish').to_frame()
countsdf['count'] = countsdf.fish.str.extract('([0-9]+)')
countsdf['species'] = countsdf.fish.str.extract('([a-zA-Z]+[ a-zA-Z]*)')
countsdf.drop('fish', axis=1, inplace=True)
Output
count species
0 38 Sand Bass
1 16 Sculpin
2 10 Blacksmith
3 138 Sculpin
4 28 Sand Bass
5 150 Sculpin Released
6 102 Sculpin
7 40 Sanddab
8 156 Sculpin
9 29 Sand Bass
10 5 Black Croaker
11 161 Sculpin
Upvotes: 1