Reputation: 2989
I have data in the following format:
1 "hi"
2 "hello"
3 "abc"
4-"def"
5(-hjs
6 "there"
abc" "def"
7 "there1"
A tab separates 1 and "hi". Another tab separates 2 and "hello" whereas between 3 and "abc" there is no such separation. Similarly for 4-"def" and 5(-hjs.
I want to delete all those lines where a tab does not separate a number and a string. I want my output to be of the following form.
1 "hi"
2 "hello"
6 "there"
7 "there1"
I tried to store only those lines which contains numbers using grep '^ *[0-9]'(although it deletes abc, yet it is not able to delete the rest). However, it deletes all the lines. Is it possible to delete only specified lines using a linux command/python.
I tried to do in it python by checking whether there is a split or not..but split does not work for patterns of the form "abc" def"
I am using a tab('\t') for tabulation..so how do i incorporate that...also can you please explain it a bit
Upvotes: 0
Views: 98
Reputation: 41456
Using awk
:
awk '/^[0-9]+\t/' file
Prints only lines that starts with one or more number [0-9]+
followed by a tab \t
Upvotes: 1
Reputation: 43447
Use regular expressions:
s = """
1 "hi"
2 "hello"
3 "abc"
4-"def"
5(-hjs
6 "there"
abc" "def"
7 "there1"
"""
import re
for line in s.splitlines():
if not line:
continue # skip empty lines
if re.match(r'^\d\t\S+', line):
print line
Output:
>>>
1 "hi"
2 "hello"
6 "there"
7 "there1"
Explanation:
The regular expression pattern tries to match the line.
^
: This means the start of the string (or line)\d
: This means match a single digit character\t
: This means match a tab character.\S+
: This means match a non-white-space character at least onceYou could change the regular expression to something like this: r'^\d\s{4,}\S+'
That adds a \s{4,}
which means a white-space character at least 4 times (which is the default character length of most tabulations).
You could also combine them into a regular expression that can handle situations where tabs are converted to white-space: r'^\d(\t|\s{4,})\S+'
This adds a group which will look for \t
OR \s{4,}
. which covers all your bases.
Upvotes: 2
Reputation: 36272
If your grep version supports the perl
regular expression syntax, you can use it like:
grep -P '^\d+\t+\S+' infile
It matches from the beginning of line (^
), a number (\d+
) followed by one or more tabs (\t+
) followed by a non-space character (\S+
).
It yields:
1 "hi"
2 "hello"
6 "there"
7 "there1"
Upvotes: 2
Reputation: 6861
Try
grep '^[0-9]*\s\{4\}'
(provided that you use 4 spaces for tabulation, as in the example you pasted).
Upvotes: 1