Nathan Pk
Nathan Pk

Reputation: 749

Each word on a separate line

I have a sentence like

This is for example

I want to write this to a file such that each word in this sentence is written to a separate line.

How can I do this in shell scripting?

Upvotes: 23

Views: 32745

Answers (8)

glenn jackman
glenn jackman

Reputation: 246764

Nobody has suggested bash's builtin read command:

s='This is for example'
read -ra words <<< "$s"
printf '%s\n' "${words[@]}"
This
is
for
example

The data is fully quoted at all times so it's not subject to filename expansion.

The current value of $IFS will control the splitting. The default value is space-tab-newline: IFS=$' \t\n'

Upvotes: 0

anupamkrishna
anupamkrishna

Reputation: 76

Use the fmt command

>> echo "This is for example" | fmt -w1 > textfile.txt ; cat textfile.txt
This
is
for
example

For full description of fmt and its options, check out the related man page.

Upvotes: 3

Pryftan
Pryftan

Reputation: 218

N.B. I wrote this in a few drafts simplifying the regexp so if there's any inconsistency that's probably why.

Do you care about punctuation marks? For example in some invocations you would see e.g. a 'word' like (etc) as that exactly with the parentheses. Or the word would be 'parentheses.' rather than 'parentheses'. If you're parsing a file with proper sentences that could be a problem esp if you're wanting to sort by word or even get a word count for each word.

There are ways to deal with this but there are some caveats and certainly there's room for improvement. These happen to do with numbers, dashes (in numbers) and decimal points/dots (in numbers). Perhaps having an exact set of rules would help resolve this but the below examples can give you some things to work on. I have made some contrived input examples to demonstrate these flaws (or whatever you wish to call them).

$ echo "This is an example sentence with punctuation marks and digits i.e. , . ; \! 7 8 9" | grep -o -E '\<[A-Za-z0-9.]*\>'
This
is
an
example
sentence
with
punctuation
marks
and
digits
i.e
7
8
9

As you can see the i.e.` turns out to be just i.e and the punctuation marks otherwise are not shown. Okay but this leaves out things like version numbers in the form of major.minor.revision-release e.g. 0.0.1-1; can this be shown too? Yes:

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[-A-Za-z0-9.]*\>'
The
current
version
is
0.0.1-1
The
previous
version
was
current
from
2017-2018

Observe that the sentences do not end with a full stop. What happens if you add a space between the years and the dash? You won't have the dash but each year will be on its own line:

$ echo "2017 - 2018" | grep -o -E '\<[-A-Za-z0-9.]*\>'
2017
2018

The question then becomes if you want - by themselves to be counted; by the very nature of separating words you won't have the years as a single string if there are spaces. Because it's not a word by itself I would think not.

I am sure these could be simplified further. In addition if you don't want any punctuation or numbers at all you could change it to:

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z]*\>'
The
current
version
is
The
previous
version
was
current
from

If you wanted to have the numbers:

$ echo "The current version is 0.0.1-1. The previous version was current from 2017-2018."|grep -o -E '\<[A-Za-z0-9]*\>'
The
current
version
is
0
0
1
1
The
previous
version
was
current
from
2017
2018

As for 'words' with both letters and numbers that's another thing that might or might not be of consideration but demonstrating the above:

$ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z0-9]*\>'
The
current
version
is
0
0
1
1
test1

Outputs them. But the following does not (because it doesn't consider numbers at all):

$ echo "The current version is 0.0.1-1. test1."|grep -o -E '\<[A-Za-z]*\>'
The
current
version
is

It's quite easy to disregard punctuation marks but in some cases there might be need or desire for them. In the case of e.g. I suppose you could use say sed to change lines like e.g to e.g. but that would be a personal preference, I guess.

I can summarise how it works but only just; I’m far too tired to even consider much:

How does it work?

I will only explain the invocation grep -o -E '\<[-A-Za-z0-9.]*\>' but much of it is the same in the others (the vertical bar/pipe symbol in extended grep allows for more than one pattern):

The -o option is for only printing matches rather than the entire line. The -E is for extended grep (could just as well have used egrep). As for the regexp itself:

The <\ and \> are word boundaries (beginning and ending respectively - you can specify only one if you want); I believe the -w option is the same as specifying both but maybe the invocation is a bit different (I don't actually know).

The '\<[-A-Za-z0-9.]*\>' says dashes, upper and lower case letters and a dot zero or more times. As for why then it turns e.g. to .e.g I at this time can only say it is the pattern but I do not have the faculties to consider it more.

Bonus script for word frequency count

#!/bin/bash

if [ $# -eq 0 ]; then
    echo "Usage: $(basename ${0}) <FILE> [FILE...]"
    exit 1
fi

for file do
    if [ -e "${file}" ]
    then
        echo "** ${file}: "
        grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|sort | uniq -c | sort -rn
    else
    echo >&2 "${1}: file not found"
    continue
    fi
done

Example:

$ cat example 
The current version is 0.0.1-1 but the previous version was non-existent.

This sentence contains an abbreviation i.e. e.g. (so actually two abbreviations).

This sentence has no numbers and no punctuation  
$ ./wordfreq example 
** example: 
   2 version
   2 sentence
   2 no
   2 This
   1 was
   1 two
   1 the
   1 so
   1 punctuation
   1 previous
   1 numbers
   1 non-existent
   1 is
   1 i.e
   1 has
   1 e.g
   1 current
   1 contains
   1 but
   1 and
   1 an
   1 actually
   1 abbreviations
   1 abbreviation
   1 The
   1 0.0.1-1

N.B. I didn't transliterate upper case to lower case so the words 'The' and 'the' show up as different words. If you wanted them to be all lower case you could change the grep invocation in the script to be piped to tr before sorting:

    grep -o -E '\<[-A-Za-z0-9.]*\>' "${file}"|tr '[A-Z]' '[a-z]'|sort | uniq -c | sort -rn

Oh and since you asked if you want to write it to a file you can just add to the command line (this is for the raw invocation):

> output_file

For the script you would use it like:

$ ./wordfreq file1 file2 file3 > output_file

Upvotes: 4

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 185015

Try using :

string="This is for example"

printf '%s\n' $string > filename.txt

or taking advantage of word-splitting

string="This is for example"

for word in $string; do
    echo "$word"
done > filename.txt

Upvotes: 9

Sepero
Sepero

Reputation: 4677

$ echo "This is for example" | xargs -n1
This
is
for
example

Upvotes: 26

koola
koola

Reputation: 1734

Try use:

str="This is for example"
echo -e ${str// /\\n} > file.out

Output

> cat file.out 
This
is
for
example

Upvotes: 2

sampson-chen
sampson-chen

Reputation: 47267

A couple ways to go about it, choose your favorite!

echo "This is for example" | tr ' ' '\n' > example.txt

or simply do this to avoid using echo unnecessarily:

tr ' ' '\n' <<< "This is for example" > example.txt

The <<< notation is used with a herestring

Or, use sed instead of tr:

sed "s/ /\n/g" <<< "This is for example" > example.txt

For still more alternatives, check others' answers =)

Upvotes: 35

Jonathan Leffler
Jonathan Leffler

Reputation: 753585

example="This is for example"
printf "%s\n" $example

Upvotes: 6

Related Questions