Reputation: 65
I have a bunch of txt files that i need to compile into a single master file. I use read_csv
to extract the information inside. There are some rows to drop, and i was wondering if it's possible to use the skiprows
feature without specifying the index number of rows that i want to drop, but rather to tell which one to drop according to its row content/value. Here's how the data looks like to illustrate my point.
Index Column 1 Column 2
0 Rows to drop Rows to drop
1 Rows to drop Rows to drop
2 Rows to drop Rows to drop
3 Rows to keep Rows to keep
4 Rows to keep Rows to keep
5 Rows to keep Rows to keep
6 Rows to keep Rows to keep
7 Rows to drop Rows to drop
8 Rows to drop Rows to drop
9 Rows to keep Rows to keep
10 Rows to drop Rows to drop
11 Rows to keep Rows to keep
12 Rows to keep Rows to keep
13 Rows to drop Rows to drop
14 Rows to drop Rows to drop
15 Rows to drop Rows to drop
What is the most effective way to do this?
Upvotes: 2
Views: 12150
Reputation: 8508
Is this what you want to achieve:
import pandas as pd
df = pd.DataFrame({'A':['row 1','row 2','drop row','row 4','row 5',
'drop row','row 6','row 7','drop row','row 9']})
df1 = df[df['A']!='drop row']
print (df)
print (df1)
Original Dataframe:
A
0 row 1
1 row 2
2 drop row
3 row 4
4 row 5
5 drop row
6 row 6
7 row 7
8 drop row
9 row 9
New DataFrame with rows dropped:
A
0 row 1
1 row 2
3 row 4
4 row 5
6 row 6
7 row 7
9 row 9
While you cannot skip rows based on content, you can skip rows based on index. Here are some options for you:
df = pd.read_csv('xyz.csv', skiprows=2)
#this will skip 2 rows from the top
df = pd.read_csv('xyz.csv', skiprows=[0,2,5])
#this will skip rows 1, 3, and 6 from the top
#remember row 0 is the 1st line
#you can also skip by counts.
#In below example, skip 0th row and every 5th row from there on
def check_row(a):
if a % 5 == 0:
return True
return False
df = pd.read_csv('xyz.txt', skiprows= lambda x:check_row(x))
More details of this can be found in this link about skip rows
Upvotes: 4
Reputation: 414
No. skiprows will not allow you to drop based on the row content/value.
Based on Pandas Documentation:
skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
Upvotes: 1
Reputation: 354
Since you cannot do that using skiprows, I could think of this way as efficient :
df = pd.read_csv(filePath)
df = df.loc[df['column1']=="Rows to keep"]
Upvotes: 1