mava
mava

Reputation: 2854

Split long string into array and keep delimiter

I've got a strange edge-case.
I have a long string which contains \n (newline characters).
So the string looks something like:

text="loremipsum\nDollor sit atmet \n aliquyam erat, 
sed diam\naliquyam erat \n sed diam"

I need to split the string into an array, but keep the newline characters uniterpreted, so the array/output looks like:

"loremipsum\n"
"Dollor sit atmet \n"
"aliquyam erat, sed diam\n"
"aliquyam erat \n"
"sed diam"

I couldn't find a way to split the string and preserve the \n characters.
If I use IFS=$"\n" the \ncharacters are deleted,
but if I use IFS="\n" it gets split and delets all occurrence of n.
I tried it like:

IFS=$"\n" read -d '' -a arr <<<"$text"

How can I solve this?

Clarification/Update

The text is dynamic and can be very long 3000+ chars,
so creating the array like: declare -a arr=([0]=$'loremipsum\n'... is not an option.
The \n characters (0x5c + 0x6e in ascii code) should all be treated the same,
the should not be replaced with an actual newline.
The \n characters must be preserved,
because the progrann which gets the output looks for these in plaintext.
The \n characters can be àt every position in a sentence,
also in a word like:
lor\nem or with spaces: Lorem \n ipsum
So the \n characters must be at the end of the elements inside the array, like shown above.
The text must only be splitted at \n not a spaces etc..

Upvotes: 1

Views: 98

Answers (2)

David C. Rankin
David C. Rankin

Reputation: 84521

You can use process substitution and echo, e.g.

text="loremipsum\nDollor sit atmet \n aliquyam erat, sed diam\naliquyam erat \n sed diam"
readarray arr < <(echo -e "$text")

You can also use printf in the process substitution as well, e.g.

< <(printf "$text")

Since the -t option is not give to readarray, the '\n' is included as part of the array element.

Example Use/Output

Adding a declare -p arr to output the array, you would have:

text="loremipsum\nDollor sit atmet \n aliquyam erat, sed diam\naliquyam erat \n sed diam"
readarray arr < <(echo -e "$text")
declare -p arr
declare -a arr=([0]=$'loremipsum\n' [1]=$'Dollor sit atmet \n' [2]=$' aliquyam erat, sed diam\n' [3]=$'aliquyam erat \n' [4]=$' sed diam\n')

If you want to trim leading whitespace, you can use the brace-expansion ${element#*[[:space:]]}. Up to you.

Upvotes: 3

markp-fuso
markp-fuso

Reputation: 33984

My understanding from the sample (input/output) data given:

  • there is one actual newline character in text (between erat, and sed diem); this is to be removed and assuming there is no (space) after erat, we need to add a (space), ie, replace the actual newline character with a (space)
  • there are 4 literal strings of \ + n; we are to break the array after these literals; the literal \ + n are to remain in the text that is stored in the array
  • the output should have a leading space removed from array values
  • I'm assuming the final results should not include the double quotes (ie, OP included the double quotes in the desired output as a means of delimiting the array values for display purposes)

One idea:

text="loremipsum\nDollor sit atmet \n aliquyam erat,
sed diam\naliquyam erat \n sed diam"

# convert actual newline character to a (space)

text=${text//$'\n'/ }

# add an actual newline character after the literal `\` + `n`

text=${text//\n/\n$'\n'}

# print our value, remove leading (space), and load into array

IFS=$'\n' arr=( $(printf "%s\n" "${text}." | sed 's/^ //g') )

# display array

typeset -p arr
declare -a arr=([0]="loremipsum\\n" [1]="Dollor sit atmet \\n" [2]="aliquyam erat, sed diam\\n" [3]="aliquyam erat \\n" [4]="sed diam.")

# loop through array and display individual strings; add double quotes as delimiters for display purposes

for i in "${!arr[@]}"
do
   echo "\"${arr[${i}]}\""
done

"loremipsum\n"
"Dollor sit atmet \n"
"aliquyam erat, sed diam\n"
"aliquyam erat \n"
"sed diam."

Upvotes: 2

Related Questions