Reputation: 1432
This is from an R guy.
I have this mess in a Pandas column: data['crew']
.
array(["[{'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production', 'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor', 'profile_path': None}, {'credit_id': '56407fa89251417055000b58', 'department': 'Sound', 'gender': 0, 'id': 6745, 'job': 'Music Editor', 'name': 'Richard Henderson', 'profile_path': None}, {'credit_id': '5789212392514135d60025fd', 'department': 'Production', 'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production', 'name': 'Jeffrey Stott', 'profile_path': None}, {'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up', 'gender': 0, 'id': 23783, 'job': 'Makeup Artist', 'name': 'Heather Plott', 'profile_path': None}
It goes on for quite some time. Each new dict starts with a credit_id
field. One sell can hold several dicts in an array.
Assume I want the names of all Casting
directors, as shown in the first entry. I need to check check the job
entry in every dict and, if it's Casting
, grab what's in the name
field and store it in my data frame in data['crew']
.
I tried several strategies, then backed off and went for something simple. Running the following shut me down, so I can't even access a simple field. How can I get this done in Pandas.
for row in data.head().iterrows():
if row['crew'].job == 'Casting':
print(row['crew'])
EDIT: Error Message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-138-aa6183fdf7ac> in <module>()
1 for row in data.head().iterrows():
----> 2 if row['crew'].job == 'Casting':
3 print(row['crew'])
TypeError: tuple indices must be integers or slices, not str
EDIT: Code used to get the array of dict (strings?) in the first place.
def convert_JSON(data_as_string):
try:
dict_representation = ast.literal_eval(data_as_string)
return dict_representation
except ValueError:
return []
data["crew"] = data["crew"].map(lambda x: sorted([d['name'] if d['job'] == 'Casting' else '' for d in convert_JSON(x)])).map(lambda x: ','.join(map(str, x))
Upvotes: 0
Views: 46
Reputation: 30971
To create a DataFrame from your sample data, write:
df = pd.DataFrame(data=[
{ 'credit_id': '54d5356ec3a3683ba0000039', 'department': 'Production',
'gender': 1, 'id': 494, 'job': 'Casting', 'name': 'Terri Taylor',
'profile_path': None},
{ 'credit_id': '56407fa89251417055000b58', 'department': 'Sound',
'gender': 0, 'id': 6745, 'job': 'Music Editor',
'name': 'Richard Henderson', 'profile_path': None},
{ 'credit_id': '5789212392514135d60025fd', 'department': 'Production',
'gender': 2, 'id': 9250, 'job': 'Executive In Charge Of Production',
'name': 'Jeffrey Stott', 'profile_path': None},
{ 'credit_id': '57892074c3a36835fa002886', 'department': 'Costume & Make-Up',
'gender': 0, 'id': 23783, 'job': 'Makeup Artist',
'name': 'Heather Plott', 'profile_path': None}])
Then you can get your data with a single instruction:
df[df.job == 'Casting'].name
The result is:
0 Terri Taylor
Name: name, dtype: object
The above result is Pandas Series object with names found.
In this case, 0
is the index value for the record found and
Terri Taylor
is the name of (the only in your data) Casting Director.
If you want just a list (not Series), write:
df[df.job == 'Casting'].name.tolist()
The result is ['Terri Taylor']
- just a list.
I think, both my solutions should be quicker than "ordinary" loop
based on iterrows()
.
Checking the execution time, you may try also yet another solution:
df.query("job == 'Casting'").name.tolist()
==========
And as far as your code is concerned:
iterrows()
returns each time a pair containing:
So your loop should look something like:
for row in df.iterrows():
if row[1].job == 'Casting':
print(row[1]['name'])
You can not write row[1].name
because it refers to the index value
(here we have a collision with default attributes of the named tuple).
Upvotes: 1