tumbleweed
tumbleweed

Reputation: 4630

How to transform several lists of words to a pandas dataframe?

I have file .txt that contains a list of words like this:

5.91686268506 exclusively, catering, provides, arms, georgia, formal, purchase, choose
5.91560417296 hugh, senlis
5.91527936181 italians
5.91470429433 soil, cultivation, fertile
5.91468087491 increases, moderation
....
5.91440227412 farmers, descendants

I would like to transform such data into a pandas table that I expect to show into a html/bootstrap template as follows (*):

COL_A         COL_B
5.91686268506 exclusively, catering, provides, arms, georgia, formal, purchase, choose
5.91560417296 hugh, senlis
5.91527936181 italians
5.91470429433 soil, cultivation, fertile
5.91468087491 increases, moderation
....
5.91440227412 farmers, descendants

So I tried the following with pandas:

import pandas as pd
df = pd.read_csv('file.csv', 
                 sep = ' ', names=['Col_A', 'Col_B'])
df.head(20)

However, my table doesnt have the above desired sructure:

                                                                                                                                COL_A   COL_B
6.281426    engaged,    chance,     makes,  meeting,    nations,    things,     believe,    tries,  believing,  knocked,    admits,     awkward
6.277438    sweden  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
6.271190    artificial,     ammonium    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
6.259790    boats,  prefix  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
6.230612    targets,    tactical,   wing,   missile,    squadrons   NaN     NaN     NaN     NaN     NaN     NaN     NaN

Any idea of how to get the data as the (*) tabular format?

Upvotes: 2

Views: 65

Answers (1)

akuiper
akuiper

Reputation: 214927

Because you have space between words and if you specify space as delimiter, it will naturally separate them. To get what you need, you can try to set the sep as a regular expression (?<!,), ?<! is a negative look behind syntax, which means separate on space only when it is not preceded by a comma and it should work for your case:

pd.read_csv("~/test.csv", sep = "(?<!,) ", names=['weight', 'topics'])

#     weight    topics
#0  5.916863    exclusively, catering, provides, arms, georgia...
#1  5.915604    hugh, senlis
#2  5.915279    italians
#3  5.914704    soil, cultivation, fertile
#4  5.914681    increases, moderation
#5  5.914402    farmers, descendants

Upvotes: 6

Related Questions