Jannat Arora
Jannat Arora

Reputation: 2989

print particular lines

I have data in the following format:

     1     "hi"
     2     "hello"
     3 "abc"
     4-"def"
     5(-hjs
     6     "there" 
     abc"    "def"
     7     "there1"

A tab separates 1 and "hi". Another tab separates 2 and "hello" whereas between 3 and "abc" there is no such separation. Similarly for 4-"def" and 5(-hjs.

I want to delete all those lines where a tab does not separate a number and a string. I want my output to be of the following form.

     1     "hi"
     2     "hello"
     6     "there" 
     7     "there1"

I tried to store only those lines which contains numbers using grep '^ *[0-9]'(although it deletes abc, yet it is not able to delete the rest). However, it deletes all the lines. Is it possible to delete only specified lines using a linux command/python.

I tried to do in it python by checking whether there is a split or not..but split does not work for patterns of the form "abc" def"

I am using a tab('\t') for tabulation..so how do i incorporate that...also can you please explain it a bit

Upvotes: 0

Views: 98

Answers (4)

Jotne
Jotne

Reputation: 41456

Using awk:

awk '/^[0-9]+\t/' file

Prints only lines that starts with one or more number [0-9]+ followed by a tab \t

Upvotes: 1

Inbar Rose
Inbar Rose

Reputation: 43447

Use regular expressions:

s = """
1     "hi"
2     "hello"
3 "abc"
4-"def"
5(-hjs
6     "there" 
abc"    "def"
7     "there1"
"""

import re

for line in s.splitlines():
    if not line:
        continue # skip empty lines
    if re.match(r'^\d\t\S+', line):
        print line

Output:

>>> 
1     "hi"
2     "hello"
6     "there" 
7     "there1"

Explanation:

The regular expression pattern tries to match the line.

  • ^ : This means the start of the string (or line)
  • \d : This means match a single digit character
  • \t : This means match a tab character.
  • \S+ : This means match a non-white-space character at least once

You could change the regular expression to something like this: r'^\d\s{4,}\S+'

That adds a \s{4,} which means a white-space character at least 4 times (which is the default character length of most tabulations).

You could also combine them into a regular expression that can handle situations where tabs are converted to white-space: r'^\d(\t|\s{4,})\S+' This adds a group which will look for \t OR \s{4,}. which covers all your bases.

Upvotes: 2

Birei
Birei

Reputation: 36272

If your version supports the perl regular expression syntax, you can use it like:

grep -P '^\d+\t+\S+' infile

It matches from the beginning of line (^), a number (\d+) followed by one or more tabs (\t+) followed by a non-space character (\S+).

It yields:

1   "hi"
2   "hello"
6   "there" 
7   "there1"

Upvotes: 2

Bartosz Marcinkowski
Bartosz Marcinkowski

Reputation: 6861

Try

grep '^[0-9]*\s\{4\}'

(provided that you use 4 spaces for tabulation, as in the example you pasted).

Upvotes: 1

Related Questions