Reputation: 57
I have a dataframe with no column headers containing strings of information. I would like to select part of the values in the cells based on a substring in each of the cells, and include these elements in a list. The examples I've found listing an answer reference a single (named) column in a dataframe, and can easily apply loc/iloc to get the str attribute of the dataframe
The dataframe looks like below:
0 A[3,5] A[6,7] C[3,9]
1 B[2,9] A[2,1] B[3,7]
2 B[5,6]
From the dataframe, I would like to extract all coordinates included in brackets which are listed as A in a list (as a string). The resulting list should be:
[3,5, 6,7, 2,1]
I have read data from a csv and included it in a dataframe. I have tried two approached: first, constructing a dataframe with the same dimensions and filling it with 1s where the letter A is present.
Secondly, I've tried to find the location of the occurrence of string "A" in the dataframe, and extracting the part between brackets in one go. In both approaches, I'm trying to use loc/iloc to reference the columns of the dataframe, but get an AttributeError: 'DataFrame' object has no attribute 'str'
. Am I thinking in the right way, or is there a more efficient manner to solve my problem?
Update: I have been able to stack the dataframe into a series and am only left with the values where "A" is contained. Now I'm looking to isolate the coordinates based on the substring between brackets. The last line of the code below yields a TypeError: 'Series' objects are mutable, thus they cannot be hashed
. How can I retrieve the substring enclosed in brackets?
df = pd.read_csv("FILE.csv", header = None)
df = df.fillna('')
s = df.stack()
s = s[s.str.contains("A")]
s = s[s.str.find("["):s.str.rfind("]")]
Upvotes: 2
Views: 167
Reputation: 862641
Error means there is no one column, but more columns. So if all data are in multiple columns use DataFrame.stack
in first step and then processing data by Series.str.replace
:
df = pd.read_csv("FILE.csv", header = None)
s = df.stack()
L = s[s.str.contains("A")].str.replace('[A\[\]]', '').tolist()
print (L)
['3,5', '6,7', '2,1']
Last if want integers in pairs is possible use list comprehension:
L1 = [[int(y) for y in x.split(',')] for x in L]
print (L1)
[[3, 5], [6, 7], [2, 1]]
Or solution by @Vishnudev with pd.eval
what is more safer like eval
:
L2 = s[s.str.contains("A")].str.replace('[A]', '').map(pd.eval).explode().tolist()
print (L2)
[3, 5, 6, 7, 2, 1]
Another idea with Series.str.extractall
:
L2 = s[s.str.contains("A")].str.extractall('(\d+)')[0].astype(int).tolist()
print (L2)
[3, 5, 6, 7, 2, 1]
EDIT:
In your solution is necessary remove fillna
for possible removed missing values by stack
:
print (df)
0 1 2
0 A[3,5] A[6,7] C[3,9]
1 B[2,9] A[2,1] B[3,7]
2 B[5,6] NaN NaN
s = df.stack()
print (s)
0 0 A[3,5]
1 A[6,7]
2 C[3,9]
1 0 B[2,9]
1 A[2,1]
2 B[3,7]
2 0 B[5,6]
dtype: object
s = s[s.str.contains("A")]
print (s)
0 0 A[3,5]
1 A[6,7]
1 1 A[2,1]
dtype: object
Last for values between []
is better use Series.str.extract
:
s = s.str.extract(r"\[(.*?)\]", expand=False)
print (s)
0 0 3,5
1 6,7
1 1 2,1
dtype: object
Your solution is possibl with list comprehension:
a = [x[x.find("[")+1:x.rfind("]")] for x in s]
print (a)
['3,5', '6,7', '2,1']
Upvotes: 5