Reputation: 122052

Removes lines with tab in them

How to remove lines with tab in them?

I've a file that looks like this:

0   absinth
Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

1   acidophilus milk
Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

2   adobo
Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.

The desired output has lines with tabs removed, i.e. :

Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.

I could do the following in python to achieve the same results:

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
  for line in fin:
    if '\t' in line:
      continue
    else:
      fout.write(line)

But I have millions of lines and it's not that efficient. So i tried this to remove the 2nd row with cut and then remove lines with single character:

$ cut -f1 WIKI_WN_food | awk 'length>1' | less

What is a more pythonic way to get the desired output?

Is there a more efficient way than the cut + awk piping I've shown above?

Upvotes: 0

Answers (6)

repzero

Reputation: 8412

You can do this with sed

sed '/\t/d' 'my_file'

look fot "\t" and delete lines that have it

Upvotes: 1

chapelo

Reputation: 2562

Try if using filter gives you an advantage

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
    fout.write(''.join([line for line in filter(
             lambda l: r'\t' not in l, fin.readlines())]))

Test if the condition r'\t' not in l works with your file. You may need to test for a set of spaces instead of \t, perhaps with regex. I had to hand code the \t into my file.txt file for the code to work. That is why I tried with regex instead, doing substitution:

import re

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
    fout.write(re.sub(r'^\d+\s{2,}[^\n]+', '', fin.read(), count=0, flags=re.M))

Only now I get an empty line instead of the line you want to eliminate.

GOT IT: the regex need a \n at the end to work:

    fout.write(re.sub(r'^\d+\s{2,}[^\n]+\n', '', fin.read(), count=0, flags=re.M))

Upvotes: 0

gboffi

Reputation: 25023

Your code is OK, you could try to optimize looking only in the beginning of the string:

if `\t' not in l[:5]: fout.write(l)

where the length of the substring depends on the max record number, it could do a difference with long strings that don't match, who knows...

Further, you may want to test mawk, grep etc as in

# Edit : the following won't work. it strips also blank lines
# mawk -F"\t" "NF==1"  original > stripped
grep -vF "\t"        original > stripped
sed -e "/\t/d"       original > stripped

to see if it's faster than a python solution.

Testing

On my system, with a file obtained by repeatedly duplicating yours. its size 1,418,973,184 I have approximate times as follows: grep 1.6s, sed 6.4s, python 4.6s. The python run time does not depend measurably on searching on the whole string or on a substring.

Addendum

I tested Jidder awk solution (as given in a comment to the OP) using mawk, my approximate timing is 3.2s. Here, for what it's worth... the winner is grep -vF

Testing transcript

The run times vary by a couple 0.1s between executions, here I'm going to report only one run timing for each command... for close results one can't make a clear decision.

On the other hand, different tools gave results much far apart than the experimental errors, and I think that we can draw some conclusions...

% ls -l original 
-rw-r--r-- 1 boffi boffi 1418973184 Dec  8 21:33 original
% cat doit.py
from sys import stdout
with open('original', 'r') as fin:
  for line in fin:
    if '\t' in line: continue
    else: stdout.write(line)
% time wc -l original 
15731133 original

real    0m0.407s
user    0m0.184s
sys     0m0.220s
% time python doit.py | wc -l
12584034

real    0m5.334s
user    0m4.880s
sys     0m1.428s
% time grep -vF "       "  original | wc -l
12584035

real    0m1.527s
user    0m1.112s
sys     0m1.400s
% time grep -v "        "  original | wc -l
12584035

real    0m1.556s
user    0m1.120s
sys     0m1.436s
% time sed -e "/\t/d"  original | wc -l
12584034

real    0m6.481s
user    0m6.104s
sys     0m1.404s
% time mawk '!/\t/'  original | wc -l
12584035

real    0m3.059s
user    0m2.608s
sys     0m1.488s
% time gawk '!/\t/'  original | wc -l
12584035

real    0m9.148s
user    0m8.680s
sys     0m1.468s
%

My example file has a truncated last line, hence the by-one difference in line counts between python and sed on one side, ans all the other tools.

Upvotes: 2

Denis Korchuganov

Reputation: 163

Try to use grep with Perl-style regular expression:

grep -vP "\t" file.in > file.out

Upvotes: 0

Ed Morton

Reputation: 203522

grep -v '\t' file

............

Upvotes: 0

unixmiah

Reputation: 3145

you can try it with tr

tr -d " \t" < tabbed-file.txt > sanitized-file.txt

man tr

tr - translate or delete characters

you can also try it with

To remove all whitespace, including tabs from left to first word, issue:

echo " This is a test" | sed -e 's/^[ \t]*//'

Upvotes: -1

Removes lines with tab in them

Answers (6)

Testing

Addendum

Testing transcript

Related Questions