user13672551
user13672551

Reputation: 25

Python: Split string into two columns by more than one seperator

I am importing data from a csv file, I want to split the column 'topThemes' into an array/dataframe with two columns.
In the first column I want to have the name of the theme (e.g. Biology), in the second column I want its associated score (e.g. 62).
When I import the column it is stored in this format:

Biology: 62\n
Economics: 12\n
Physics: 4\n
Chemistry: 8\n
and so on.

My current code and the error is shown below.

Code:

df = pd.read_csv(r'myfilelocation')

split = [line.split(': ') for line in df['topThemes'].split('\n')]

Error:

AttributeError("'Series' object has no attribute 'split'")

CSV file being imported:

My csv file

How I want it to look:

Ideal format

Thanks for any help / responses.

Upvotes: 2

Views: 121

Answers (1)

Terry Spotts
Terry Spotts

Reputation: 4035

Specify the delimiter to use with sep and the column names with names of the read_csv() function:

df = pd.read_csv(r'myfilelocation', sep=':', names=['topThemes', 'score'])

Documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Oh, I see your source CSV file now...
There's probably a cleaner way to do this in less steps, but I think this produces your requested output:

data = pd.read_csv(r'myfilelocation', usecols=['topThemes'])
data = pd.DataFrame(data['topThemes'].str.split('\n').values.tolist()).stack().to_frame(name='raw')

df = pd.DataFrame()
df[['topTheme', 'score']] = data['raw'].apply(lambda x: pd.Series(str(x).split(":")))
df.dropna(inplace=True)

Upvotes: 1

Related Questions