Reputation: 122052
How to remove lines with tab in them?
I've a file that looks like this:
0 absinth
Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.
1 acidophilus milk
Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk
2 adobo
Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.
The desired output has lines with tabs removed, i.e. :
Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.
Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk
Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.
I could do the following in python to achieve the same results:
with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
for line in fin:
if '\t' in line:
continue
else:
fout.write(line)
But I have millions of lines and it's not that efficient. So i tried this to remove the 2nd row with cut and then remove lines with single character:
$ cut -f1 WIKI_WN_food | awk 'length>1' | less
What is a more pythonic way to get the desired output?
Is there a more efficient way than the cut + awk piping I've shown above?
Upvotes: 0
Views: 135
Reputation: 8412
You can do this with sed
sed '/\t/d' 'my_file'
look fot "\t" and delete lines that have it
Upvotes: 1
Reputation: 2562
Try if using filter
gives you an advantage
with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
fout.write(''.join([line for line in filter(
lambda l: r'\t' not in l, fin.readlines())]))
Test if the condition r'\t' not in l
works with your file. You may need to test for a set of spaces instead of \t, perhaps with regex. I had to hand code the \t into my file.txt file for the code to work. That is why I tried with regex instead, doing substitution:
import re
with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
fout.write(re.sub(r'^\d+\s{2,}[^\n]+', '', fin.read(), count=0, flags=re.M))
Only now I get an empty line instead of the line you want to eliminate.
GOT IT: the regex need a \n
at the end to work:
fout.write(re.sub(r'^\d+\s{2,}[^\n]+\n', '', fin.read(), count=0, flags=re.M))
Upvotes: 0
Reputation: 25023
Your code is OK, you could try to optimize looking only in the beginning of the string:
if `\t' not in l[:5]: fout.write(l)
where the length of the substring depends on the max record number, it could do a difference with long strings that don't match, who knows...
Further, you may want to test mawk
, grep
etc as in
# Edit : the following won't work. it strips also blank lines
# mawk -F"\t" "NF==1" original > stripped
grep -vF "\t" original > stripped
sed -e "/\t/d" original > stripped
to see if it's faster than a python solution.
On my system, with a file obtained by repeatedly duplicating yours. its size 1,418,973,184 I have approximate times as follows: grep 1.6s, sed 6.4s, python 4.6s. The python run time does not depend measurably on searching on the whole string or on a substring.
I tested Jidder awk solution (as given in a comment to the OP) using mawk
, my approximate timing is 3.2s. Here, for what it's worth... the winner is grep -vF
The run times vary by a couple 0.1s between executions, here I'm going to report only one run timing for each command... for close results one can't make a clear decision.
On the other hand, different tools gave results much far apart than the experimental errors, and I think that we can draw some conclusions...
% ls -l original
-rw-r--r-- 1 boffi boffi 1418973184 Dec 8 21:33 original
% cat doit.py
from sys import stdout
with open('original', 'r') as fin:
for line in fin:
if '\t' in line: continue
else: stdout.write(line)
% time wc -l original
15731133 original
real 0m0.407s
user 0m0.184s
sys 0m0.220s
% time python doit.py | wc -l
12584034
real 0m5.334s
user 0m4.880s
sys 0m1.428s
% time grep -vF " " original | wc -l
12584035
real 0m1.527s
user 0m1.112s
sys 0m1.400s
% time grep -v " " original | wc -l
12584035
real 0m1.556s
user 0m1.120s
sys 0m1.436s
% time sed -e "/\t/d" original | wc -l
12584034
real 0m6.481s
user 0m6.104s
sys 0m1.404s
% time mawk '!/\t/' original | wc -l
12584035
real 0m3.059s
user 0m2.608s
sys 0m1.488s
% time gawk '!/\t/' original | wc -l
12584035
real 0m9.148s
user 0m8.680s
sys 0m1.468s
%
My example file has a truncated last line, hence the by-one difference in line counts between python and sed on one side, ans all the other tools.
Upvotes: 2
Reputation: 163
Try to use grep with Perl-style regular expression:
grep -vP "\t" file.in > file.out
Upvotes: 0
Reputation: 3145
you can try it with tr
tr -d " \t" < tabbed-file.txt > sanitized-file.txt
man tr
tr - translate or delete characters
--
you can also try it with
To remove all whitespace, including tabs from left to first word, issue:
echo " This is a test" | sed -e 's/^[ \t]*//'
Upvotes: -1