Bravo
Bravo

Reputation: 677

Reading string data separated by spaces in Pandas

I have a two-column data in a text file, eg as follows.

Balkrishna Industries Ltd. Auto Ancillaries 3.54
Aurobindo Pharma Ltd. Pharmaceuticals 3.36
NIIT Technologies Ltd. Software 3.31
Sonata Software Ltd. Software 3.21

When I tried to read this in Pandas, I get an error as the space is a delimiter, and the company names are not restricted to a single column. How do I modify my code to separate this data into two columns, one for the name and one for the number?

import numpy as np
import pandas as pd

data = pd.read_csv('file.txt', sep=" ", header=None)
data.columns = ["Name", "Fraction"]

print(data)

Upvotes: 3

Views: 1940

Answers (4)

Lante Dellarovere
Lante Dellarovere

Reputation: 1858

Use a "char-space-digit" separator:

import pandas as pd


df = pd.read_csv("mycsv.txt", sep="\w\s\d", engine="python", names=["Name", "Fraction"])
print(df)

                                         Name  Fraction
0  Balkrishna Industries Ltd. Auto Ancillarie      0.54
1        Aurobindo Pharma Ltd. Pharmaceutical      0.36
2              NIIT Technologies Ltd. Softwar      0.31
3                Sonata Software Ltd. Softwar      0.21

Upvotes: 1

Andy L.
Andy L.

Reputation: 25259

Just read it in as one-column dataframe as this sample:

df:
                                               name
0  Balkrishna Industries Ltd. Auto Ancillaries 3.54
1        Aurobindo Pharma Ltd. Pharmaceuticals 3.36
2              NIIT Technologies Ltd. Software 3.31
3                Sonata Software Ltd. Software 3.21

After that just call str.rpartition on df.name and drop the blank column as follows:

df.name.str.rpartition().drop(1, 1).set_axis(["Name", "Fraction"], axis=1, inplace=False)

Out[1594]:
                                          Name Fraction
0  Balkrishna Industries Ltd. Auto Ancillaries     3.54
1        Aurobindo Pharma Ltd. Pharmaceuticals     3.36
2              NIIT Technologies Ltd. Software     3.31
3                Sonata Software Ltd. Software     3.21

Upvotes: 0

Chris Adams
Chris Adams

Reputation: 18647

Another approach, read the file in as one column (use a sep character that doesn't exist in the file - such as |).

Then use Series.str.rsplit, with n=1 and expand=True arguments, to split the string, on whitespace, from the right, with only 1 partition, returned as a DataFrame with 2 columns:

df = pd.read_csv('file.txt', sep='|', header=None)

df = df[0].str.rsplit(' ', n=1, expand=True)
df.columns = ["Name", "Fraction"]

[out]

                                          Name     Fraction
0  Balkrishna Industries Ltd. Auto Ancillaries         3.54
1        Aurobindo Pharma Ltd. Pharmaceuticals         3.36
2              NIIT Technologies Ltd. Software         3.31
3                Sonata Software Ltd. Software         3.21

Upvotes: 2

Rakesh
Rakesh

Reputation: 82785

Using Regex Lookbehind & Lookahead sep="(?<=\w) (?=\d)"

Ex:

import pandas as pd

df = pd.read_csv(filename, sep="(?<=\w) (?=\d)", names=["Name", "Fraction"])
print(df)

Output:

                                          Name  Fraction
0  Balkrishna Industries Ltd. Auto Ancillaries      3.54
1        Aurobindo Pharma Ltd. Pharmaceuticals      3.36
2              NIIT Technologies Ltd. Software      3.31
3                Sonata Software Ltd. Software      3.21

Upvotes: 7

Related Questions