Reputation: 1621

copying first string into second line

I have a text file in this format:

abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375 Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375 aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375 abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625

Here I call the first string before the first space as word (for example abacısı)

The string which starts with after first space and ends with integer is definition (for example Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875)

I want to do this: If a line includes more than one definition (first line has one, second line has two, third line has three), apply newline and put the first string (word) into the beginning of the new line. Expected output:

abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625

I have almost 1.500.000 lines in my text file and the number of definition is not certain for each line. It can be 1 to 5

Upvotes: 1

Answers (9)

Casimir et Hippolyte

Reputation: 89557

With perl:

perl -a -F'[^]:]\K\h' -ne 'chomp(@F);$p=shift(@F);print "$p ",shift(@F),"\n" while(@F);' yourfile.txt

With bash:

while read -r line
do
    pre=${line%% *}
    echo "$line" | sed 's/\([0-9]\) /\1\n'$pre' /g'
done < "yourfile.txt"

This script read the file line by line. For each line, the prefix is extracted with a parameter expansion (all until the first space) and spaces preceded by a digit are replaced with a newline and the prefix using sed.

edit: as tripleee suggested it, it's much faster to do all with sed:

sed -i.bak ':a;s/^\(\([^ ]*\).*[0-9]\) /\1\n\2 /;ta' yourfile.txt

Upvotes: 2

tripleee

Reputation: 189397

I would approach this with one of the excellent Awk answers here; but I'm posting a Python solution to point to some oddities and problems with the currently accepted answer:

It reads the entire input file into memory before processing it. This is harmless for small inputs, but the OP mentions that the real-world input is kind of big.
It needlessly uses re when simple whitespace tokenization appears to be sufficient.

I would also prefer a tool which prints to standard output, so that I can redirect it where I want it from the shell; but to keep this compatible with the earlier solution, this hard-codes output.txt as the destination file.

with open('input.txt', 'r') as input:
  with open('output.txt', 'w') as output:
    for line in input:
      tokens = line.rstrip().split()
      word = tokens[0]
      for idx in xrange(1, len(tokens), 3):
          print(word, ' ', ' '.join(tokens[idx:idx+3]), file=output)

If you really, really wanted to do this in pure Bash, I suppose you could:

while read -r word analyses; do
    set -- $analyses
    while [ $# -gt 0 ]; do
        printf "%s %s %s %s\n" "$word" "$1" "$2" "$3"
        shift; shift; shift
    done
done <input.txt >output.txt

Upvotes: 1

anishsane

Reputation: 20980

Using perl:

$ perl -nE 'm/([^ ]*) (.*)/; my $word=$1; $_=$2; say $word . " " . $_ for / *(.*?[0-9]+\.[0-9]+)/g;' < input.log

Output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625

Explanation:

Split the line to separate first field as word.
Then split the remaining line using the regex .*?[0-9]+\.[0-9]+.
Print word concatenated with every match of above regex.

Upvotes: 1

Varun

Reputation: 477

Please find the following bash code

    #!/bin/bash
    # read.sh
    while read variable
    do
            for i in "$variable"
            do
                    var=`echo "$i" |wc -w`
                    array_1=( $i )
                    counter=0
                    for((j=1 ; j < $var ; j++))
                    do
                            if [ $counter = 0 ]  #1
                            then
                                    echo -ne ${array_1[0]}' '
                            fi #1
                            echo -ne ${array_1[$j]}' '
                            counter=$(expr $counter + 1)
                            if [ $counter = 3 ] #2
                            then
                                    counter=0
                                    echo
                            fi #2
                    done
            done
    done

I have tested and it is working. To test On bash shell prompt give the following command

     $ ./read.sh < input.txt > output.txt

where read.sh is script , input.txt is input file and output.txt is where output is generated

Upvotes: 0

Adam Katz

Reputation: 16138

I am assuming the following format:

word definitionkey : definitionvalue [definitionkey : definitionvalue …]

None of those elements may contain a space and they are always delimited by a single space.

The following code should work:

awk '{ for (i=2; i<=NF; i+=3) print $1, $i, $(i+1), $(i+2) }' file

Explanation (this is the same code but with comments and more spaces):

awk '
  # match any line
  {
    # iterate over each "key : value"
    for (i=2; i<=NF; i+=3)
      print $1, $i, $(i+1), $(i+2)  # prints each "word key : value"
  }
' file

awk has some tricks that you may not be familiar with. It works on a line-by-line basis. Each stanza has an optional conditional before it (awk 'NF >=4 {…}' would make sense here since we'll have an error given fewer than four fields). NF is the number of fields and a dollar sign ($) indicates we want the value of the given field, so $1 is the value of the first field, $NF is the value of the last field, and $(i+1) is the value of the third field (assuming i=2). print will default to using spaces between its arguments and adds a line break at the end (otherwise, we'd need printf "%s %s %s %s\n", $1, $i, $(i+1), $(i+2), which is a bit harder to read).

Upvotes: 2

Benjamin W.

Reputation: 52142

Bash and grep:

#!/bin/bash

while IFS=' ' read -r in1 in2 in3 in4; do
    if [[ -n $in4 ]]; then
        prepend="$in1"
        echo "$in1 $in2 $in3 $in4"
    else
        echo "$prepend $in1 $in2 $in3"
    fi
done < <(grep -o '[[:alnum:]][^:]\+ : [[:digit:].]\+' "$1")

The output of grep -o is putting all definitions on a separate line, but definitions originating from the same line are missing the "word" at the beginning:

abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625

The for loop now loops over this, using a space as the input file separator. If in4 is a zero length string, we're on a line where the "word" is missing, so we prepend it.

The script takes the input file name as its argument, and saving output to an output file can be done with simple redirection:

./script inputfile > outputfile

Upvotes: 1

andreas-hofmann

Reputation: 2518

Small python script does the job. Input is expected in input.txt, output gotes to output.txt.

import re

rf = re.compile('([^\s]+\s).+')
r = re.compile('([^\s]+\s\:\s\d+\.\d+)')

with open("input.txt", "r") as f:
    text = f.read()

with open("output.txt", "w") as f:
    for l in text.split('\n'):
        offset = 0
        first = ""
        match = re.search(rf, l[offset:])
        if match:
            first = match.group(1)
            offset = len(first)
        while True:
            match =  re.search(r, l[offset:])
            if not match:
                break
            s = match.group(1)
            offset += len(s)
            f.write(first + " " + s + "\n")

Upvotes: 3

repzero

Reputation: 8412

here is a sed in action

sed -r '/^indirger(ken|di)/{s/([0-9]+[.][0-9]+ )(indirge)/\1\n\2/g}' my_file

output

indirgerdi indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]+YDH[Past] : 22.2626953125 
indirge[Verb]+[Pos]+Hr[Aor]+YDH[Past]+[A3sg] : 18.720703125
indirgerken indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]-Yken[Adv+While] : 19.6201171875

Upvotes: -1

glenn jackman

Reputation: 246807

Assuming there are always 4 space-separated words for each definition:

awk '{for (i=1; i<NF; i+=4) print $i, $(i+1), $(i+2), $(i+3)}' file

Or if the split should occur after that floating point number

perl -pe 's/\b\d+\.\d+\K\s+(?=\S)/\n/g' file

(This is the perl equivalent of Avinash's answer)

Upvotes: 1

copying first string into second line

Answers (9)

Related Questions