Reputation: 677
I have a two-column data in a text file, eg as follows.
Balkrishna Industries Ltd. Auto Ancillaries 3.54
Aurobindo Pharma Ltd. Pharmaceuticals 3.36
NIIT Technologies Ltd. Software 3.31
Sonata Software Ltd. Software 3.21
When I tried to read this in Pandas, I get an error as the space is a delimiter, and the company names are not restricted to a single column. How do I modify my code to separate this data into two columns, one for the name and one for the number?
import numpy as np
import pandas as pd
data = pd.read_csv('file.txt', sep=" ", header=None)
data.columns = ["Name", "Fraction"]
print(data)
Upvotes: 3
Views: 1940
Reputation: 1858
Use a "char-space-digit" separator:
import pandas as pd
df = pd.read_csv("mycsv.txt", sep="\w\s\d", engine="python", names=["Name", "Fraction"])
print(df)
Name Fraction
0 Balkrishna Industries Ltd. Auto Ancillarie 0.54
1 Aurobindo Pharma Ltd. Pharmaceutical 0.36
2 NIIT Technologies Ltd. Softwar 0.31
3 Sonata Software Ltd. Softwar 0.21
Upvotes: 1
Reputation: 25259
Just read it in as one-column dataframe as this sample:
df:
name
0 Balkrishna Industries Ltd. Auto Ancillaries 3.54
1 Aurobindo Pharma Ltd. Pharmaceuticals 3.36
2 NIIT Technologies Ltd. Software 3.31
3 Sonata Software Ltd. Software 3.21
After that just call str.rpartition
on df.name
and drop the blank column as follows:
df.name.str.rpartition().drop(1, 1).set_axis(["Name", "Fraction"], axis=1, inplace=False)
Out[1594]:
Name Fraction
0 Balkrishna Industries Ltd. Auto Ancillaries 3.54
1 Aurobindo Pharma Ltd. Pharmaceuticals 3.36
2 NIIT Technologies Ltd. Software 3.31
3 Sonata Software Ltd. Software 3.21
Upvotes: 0
Reputation: 18647
Another approach, read the file in as one column (use a sep
character that doesn't exist in the file - such as |
).
Then use Series.str.rsplit
, with n=1
and expand=True
arguments, to split the string, on whitespace, from the right, with only 1 partition, returned as a DataFrame
with 2 columns:
df = pd.read_csv('file.txt', sep='|', header=None)
df = df[0].str.rsplit(' ', n=1, expand=True)
df.columns = ["Name", "Fraction"]
[out]
Name Fraction
0 Balkrishna Industries Ltd. Auto Ancillaries 3.54
1 Aurobindo Pharma Ltd. Pharmaceuticals 3.36
2 NIIT Technologies Ltd. Software 3.31
3 Sonata Software Ltd. Software 3.21
Upvotes: 2
Reputation: 82785
Using Regex Lookbehind & Lookahead sep="(?<=\w) (?=\d)"
Ex:
import pandas as pd
df = pd.read_csv(filename, sep="(?<=\w) (?=\d)", names=["Name", "Fraction"])
print(df)
Output:
Name Fraction
0 Balkrishna Industries Ltd. Auto Ancillaries 3.54
1 Aurobindo Pharma Ltd. Pharmaceuticals 3.36
2 NIIT Technologies Ltd. Software 3.31
3 Sonata Software Ltd. Software 3.21
Upvotes: 7