Reputation: 761
I have data encoded in string in one DataFrame column:
id data
0 a 2;0;4208;1;790
1 b 2;0;768;1;47
2 c 2;0;92;1;6
3 d 1;0;341
4 e 3;0;1;2;6;4;132
5 f 3;0;1;1;6;3;492
Data represents count how many times some events happened in our system. We can have 256 different events (each has numerical id assigned from range 0-255). As usually we have only a few events happen in one measurement period is doesn't make sense to store all zeros. That's why data is encoded as follows: first number tells how many events happened during measurement period, then each pair contains event_id and counter.
For example:
"3;0;1;1;6;3;492" means:
I need to decode the data to separate columns. Expected result is DataFrame which looks like this:
id data_0 data_1 data_2 data_3 data_4
0 a 4208.0 790.0 0.0 0.0 0.0
1 b 768.0 47.0 0.0 0.0 0.0
2 c 92.0 6.0 0.0 0.0 0.0
3 d 341.0 0.0 0.0 0.0 0.0
4 e 1.0 0.0 6.0 0.0 132.0
5 f 1.0 6.0 0.0 492.0 0.0
I came up with the following function to do it:
def split_data(data: pd.Series):
tmp = data.str.split(';', expand=True).astype('Int32').fillna(-1)
tmp = tmp.apply(
lambda row: {'{0}_{1}'.format(data.name,row[i*2-1]): row[i*2] for i in range(1,row[0]+1)},
axis='columns',
result_type='expand').fillna(0)
return tmp
df = pd.concat([df, split_data(df.pop('data'))], axis=1)
The problem is that I have millions of lines to process and it takes A LOT of time. As I don't have that much experience with pandas, I hope someone would be able to help me with more efficient way of performing this task.
Ok, so I took all three answers and performed some benchmarking :) . Starting conditions: I already have a DataFrame (this will be important!). As expected all of them were waaaaay faster than my code. For example for 15 rows with 1000 repeats in timeit:
Seems like Schalton's code wins!
However... for 1500 rows with 50 repeats:
I decided to check once more, this time only one attempt but for 150 000 rows:
Interesting thing happens: as the size of DataFrame gets bigger, all versions except Shubham's take much longer! Two fastest are Schalton's and Shubham's versions. This is were the starting point matters! I already have existing DataFrame so I have to convert it to dictionary. Dictionary itself is processed really fast. Conversion however takes time. Shubham's solution is more or less independent on size! Schalton's works very well for small data sets but due to conversion to dict it gets much slower for large amount of data. Another comparison, this time 150000 rows with 30 repeats:
However for 15 rows with 30000 repeats:
In the end choice between Schalton's version and Shubham's depends on the use case:
As mentioned above, I have data sets around 1mln rows and more, thus I will go with Shubham's answer.
Upvotes: 1
Views: 107
Reputation: 71689
pairs = df['data'].str.extractall(r'(?<!^)(\d+);(\d+)')
pairs = pairs.droplevel(1).pivot(columns=0, values=1).fillna(0)
df[['id']].join(pairs.add_prefix('data_'))
Extract
all pairs
using a regex pattern
0 1
match
0 0 0 4208
1 1 790
1 0 0 768
1 1 47
2 0 0 92
1 1 6
3 0 0 341
4 0 0 1
1 2 6
2 4 132
5 0 0 1
1 1 6
2 3 492
Pivot the pairs
to reshape into desired format
0 0 1 2 3 4
0 4208 790 0 0 0
1 768 47 0 0 0
2 92 6 0 0 0
3 341 0 0 0 0
4 1 0 6 0 132
5 1 6 0 492 0
Join the reshaped pairs
dataframe back with id
column
id data_0 data_1 data_2 data_3 data_4
0 a 4208 790 0 0 0
1 b 768 47 0 0 0
2 c 92 6 0 0 0
3 d 341 0 0 0 0
4 e 1 0 6 0 132
5 f 1 6 0 492 0
Upvotes: 1
Reputation: 14238
A much more efficient method is to construct dicts from your data
.
Do you observe how the alternate values in the split string are keys and values?
Then apply pd.Series
and fillna(0)
to get the dataframe with all required columns for the data.
Then you can concat.
Code:
df_data = df['data'].apply(
lambda x:dict(zip(x.split(';')[1::2], x.split(';')[2::2]))).apply(pd.Series).fillna(0)
df_data.columns = df_data.columns.map('data_{}'.format)
df = pd.concat([df.drop('data',axis=1), df_data], axis=1)
output:
id data_0 data_1 data_2 data_4 data_3
0 a 4208 790 0 0 0
1 b 768 47 0 0 0
2 c 92 6 0 0 0
3 d 341 0 0 0 0
4 e 1 0 6 132 0
5 f 1 6 0 0 492
If you need sorted columns you can just do:
df = df[sorted(df.columns)]
Upvotes: 1
Reputation: 3104
I'd avoid processing this in pandas, assuming you have the data in some other format I'd parse it into lists of dictionaries then load it into pandas.
import pandas as pd
from typing import Dict
data = {
"a": "2;0;4208;1;790",
"b": "2;0;768;1;47",
"c": "2;0;92;1;6",
"d": "1;0;341",
"e": "3;0;1;2;6;4;132",
"f": "3;0;1;1;6;3;492"
}
def get_event_counts(event_str: str, delim: str = ";") -> Dict[str, int]:
"""
given an event string return a dictionary of events
"""
EVENT_COUNT_INDEX = 0
split_event = event_str.split(delim)
event_count = int(split_event[EVENT_COUNT_INDEX])
events = {
split_event[index*2+1]: int(split_event[index*2+2]) for index in range(event_count - 1 // 2)
}
return events
data_records = [{"id": k, **get_event_counts(v)} for k,v in data.items()]
print(pd.DataFrame(data_records))
id 0 1 2 4 3
0 a 4208 790.0 NaN NaN NaN
1 b 768 47.0 NaN NaN NaN
2 c 92 6.0 NaN NaN NaN
3 d 341 NaN NaN NaN NaN
4 e 1 NaN 6.0 132.0 NaN
5 f 1 6.0 NaN NaN 492.0
If you're situated on your current df as the input, you could try this:
def process_starting_dataframe(starting_dataframe: pd.DataFrame) -> pd.DataFrame:
"""
Create a new dataframe from original input with two columns "id" and "data
"""
data_dict = starting_df.T.to_dict()
data_records = [{"id": i['id'], **get_event_counts(i['data'])} for i in data_dict.values()]
return pd.DataFrame(data_records)
Upvotes: 1