Reputation: 25
I wanted to transform a dataset or create a new one that takes a dataset column with labels as input which automatically has sequences of strings according to a pre-defined length (and pads if necessary). The example below should demonstrate what I mean.
I was able to manually create a new dataframe based on ngrams. This is obviously computationally expensive and creates many columns with repetitive words.
text labels
0 from dbl visual com david b lewis subject comp... 5
1 from johan blade stack urc tue nl johan wevers... 11
2 from mzhao magnus acs ohio state edu min zhao ... 6
3 from lhawkins annie wellesley edu r lee hawkin... 14
4 from seanmcd ac dal ca subject powerpc ruminat... 4
for example for sequence length 4 into something like this:
text labels
0 from dbl visual com 5
1 david b lewis subject 5
2 comp windows x frequently 5
3 asked questions <PAD> <PAD> 5
4 from johan blade stack 11
5 urc tue nl johan 11
6 wevers subject re <PAD> 11
7 from mzhao magnus acs 6
8 ohio state edu min 6
9 zhao subject composite <PAD> 6
As explained I was able to create a new dataframe based on ngrams. I could theoretically delete every n-rows afterwards again.
df = pd.read_csv('data.csv')
longform = pd.DataFrame(columns=['text', 'labels'])
for idx, content in df.iterrows():
name_words = (i.lower() for i in content[0].split())
ngramlis = list(ngrams(name_words,20))
longform = longform.append(
[{'words': ng, 'labels': content[1]} for ng in ngramlis],
ignore_index=True
)
longform['text_new'] = longform['words'].apply(', '.join)
longform['text_new'] = longform['text_new'].str.replace(',', '')
This is really bad code which is why I am quite confident that someone might come up with a better solutions.
Thanks in advance!
Upvotes: 0
Views: 617
Reputation: 29742
Use pandas.DataFrame.explode
.
Divide the words into evenly sized chunks (and padded) then :
def chunks(l, n):
"""Yield successive n-sized chunks from l.
Source: https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n_items = len(l)
if n_items % n:
n_pads = n - n_items % n
else:
n_pads = 0
l = l + ['<PAD>' for _ in range(n_pads)]
for i in range(0, len(l), n):
yield l[i:i + n]
df['text'] = df['text'].str.split().apply(lambda x: list(chunks(x, 4)))
df = df.explode('text').reset_index(drop=True)
df['text'] = df['text'].apply(' '.join)
print(df)
Output:
text labels
0 from dbl visual com 5
1 david b lewis subject 5
2 comp <PAD> <PAD> <PAD> 5
3 from johan blade stack 11
4 urc tue nl johan 11
5 wevers <PAD> <PAD> <PAD> 11
6 from mzhao magnus acs 6
7 ohio state edu min 6
8 zhao <PAD> <PAD> <PAD> 6
9 from lhawkins annie wellesley 14
10 edu r lee hawkin 14
11 from seanmcd ac dal 4
12 ca subject powerpc ruminat 4
Upvotes: 1