Reputation: 223

Extracting a list of files and creating a new file containing this list [Part 2]

I had earlier asked a similar question on

Extracting a list of files and creating a new file containing this list

However, this time it is more challenging :

I am currently dealing with a folder containing about 1000 files and I have to extract some filenames from this folder and create another file (configuration file) containing these filenames.

Basically, the folder has filenames in the following format :

1_Apple_A1_someword.txt 
1_Apple_A2_someword.txt
2_Apple_A1_someword.txt 
2_Apple_A2_someword.txt 
3_Apple_A1_someword.txt 
3_Apple_A2_someword.txt

and so on up until

1000_Apple_A1_someword.txt
1000_Apple_A2_someword.txt

I want to create another file which has 'labels' (Unix variables) for each of these files whose values are the names of the two files for each "label" in the format below. (the two files for each label are separated by a tab). Also, the 'labels' are part of the filenames (everything up until the word "Apple") For example,

1_Apple=1_Apple_A1_someword.txt 1_Apple_A2_someword.txt
2_Apple=2_Apple_A1_someword.txt 2_Apple_A2_someword.txt
3_Apple=3_Apple_A1_someword.txt 3_Apple_A2_someword.txt

and so on...till

1000_Apple=1000_Apple_A1_someword.txt 1000_Apple_A2_someword.txt

Could you tell me a one-line Unix command that does this ? Maybe using "awk" and "sed"

Upvotes: 0

Answers (7)

Vijay

Reputation: 67291

> ls -1 | perl -F_ -ane 'chomp;if($_=~m/Apple_A/){$X{$F[0]."_".$F[1]}=$X{$F[0]."_".$F[1]}." ".$_;}END{foreach (keys %X){print $_."=".$X{$_}."\n"}}'
3_Apple= 3_Apple_A1_someword.txt 3_Apple_A2_someword.txt
2_Apple= 2_Apple_A1_someword.txt 2_Apple_A2_someword.txt
1_Apple= 1_Apple_A1_someword.txt 1_Apple_A2_someword.txt

Upvotes: 1

par181

Reputation: 401

num=1
while [ $num -le 1000 ]
do
echo "${num}_Apple=${num}_Apple_A1_someword.txt ${num}_Apple_A2_somword.txt"
num=`expr $num + 1`
done

Output:

1_Apple=1_Apple_A1_someword.txt 1_Apple_A2_somword.txt
2_Apple=2_Apple_A1_someword.txt 2_Apple_A2_somword.txt
3_Apple=3_Apple_A1_someword.txt 3_Apple_A2_somword.txt
4_Apple=4_Apple_A1_someword.txt 4_Apple_A2_somword.txt
5_Apple=5_Apple_A1_someword.txt 5_Apple_A2_somword.txt
...........

If the number 1000 is not static then, you can get the value from file itself as:

num=`cat file|sort|tail -1|awk -F"_" '{print $1}'

Thanks

Upvotes: 0

Janito Vaqueiro Ferreira Filho

Reputation: 5072

Using a sed script:

#!/bin/sed -nf

: loop
H
s/\([^_]*_[^_]*\)_.*/\1/g

t clear_flag
: clear_flag

$! {
    N
    s/^\([^_]*_[^\n]*\)\n\(\1[^\n]*\)$/\2/
    t loop
}

x
s/^\n//
s/\([^_]*_[^_]*\)_/\1=\1_/
s/\n/ /gp

s/.*//
x
D

I'll try to explain everything. First, we have a loop to join together all files that start with the same prefix. I defined a prefix based on your examples, and it is defined as a string that ends on the second underscore. A loop is defined by a label, with the ":" command. Here, we labeled our loop as "loop". Further below, when necessary, we "jump" back to the start f the loop with the "t" test command.

The first command is to append the line into hold-space (an auxiliary buffer). The line is prefixed with a newline ('\n') automatically by sed before it is appended.

The second command extracts the prefix. We do that by capturing a sequence of characters that aren't underscores ([^_]*), then an underscore, then more characters that aren't underscores. Because this pattern is between backslashed parenthesis ($ and $) sed will capture the input that matches this pattern and save into an auxiliary variable, named \1 (because it is the first capture on that line). Then we skip an underscore followed by a sequence of any characters. The replacement is what we captured, so in reality we just removed everything after and including the second underscore.

We now use a workaround to clear seds internal flag indicating if a successful substitution happened since the last "t" command or since the start of the script. The test command ("t") will branch (jump) to a label if a substitution command succeeds, and then clear the internal flag. This is necessary for our second "t" command further below. If it succeeds or fails (ie. if it branches or not), it will still continue executing after the "clear_flag" label.

Now we use the "{" command to start a group of commands. However, we have an address prefix before it, which sed uses to determine if it should run these commands or not. In our case, the group is only executed if the last input line read wasn't the last line (the dollar symbol "$" represents the last input line, and the "!" represents negation).

The first command in the group will append the next line from the input into the current pattern space (ie. working buffer). The previous line and the new line are separated by a newline character (\n).

The third command will check to see if the newly read line starts with our prefix, and remove the isolated prefix (ie. the previous line). Because we removed the second underscore from the prefix we kept on the previous line, and because we appended a new line, the isolated prefix now ends before the newline character. Therefore the captured pattern now reads characters that aren't newline ([^\n]*) after the underscore. After we captured the isolated prefix, we skip the newline character separating the previous and the new line, then we start another capture (that will be stored in \2, because it is the second capture on this line). This capture will (hopefully) match the second line. Hopefully because we require that the match starts exactly as what was matched in the first capture (that's wy the first thing in the second capture is the back-reference to the first cature, ie. \1). After that, we match a sequence of characters that aren't newlines, and after the second capture we expect the end of the line.

If this last substitution command succeeds, we have discovered that the newly read line also has the same prefix, so we must now jump back to the start of the loop. That's the function of the "t" command. It will test if any substitution commands succeeded since the last "t" command, and if so, branch to a given label. In our case, we branch (jump) back to the "loop" label. Now we can see why we needed the previous "t" workaround. Without it, the first substitute command might succeed while the one we're actually interested in might fail, and "t" would still branch back to the "loop" label.

If it leaves the loop, it means that the newly read line doesn't have the same prefix. Therefore we can now print what was matched before.

We start of by swapping the contents of the pattern space with the contents of the hold space using the exchange ("x") command. Now our pattern space contains all files that had the same prefix, and our hold space contains the current prefix in an isolated line and then the a line with the first file that doesn't share that same prefix.

Since previously we appended all of the file names to the hold space, all file names are separated by newlines, and since the first file name was also appended, the first byte in the current pattern space is a newline character. To remove it, we simple replace it with nothing.

Now we have to generate the format of the assignment. That's why we have a familiar substitute command, we're again extracting the prefix, except that now we've removed the .* in order to keep the rest of the line intact. The replacement includes the prefix (captured), an equals sign, and also we restore what we removed from the first file in the pattern space: its prefix and its underscore.

We're almost ready to print out the line, but the file names are still separated by newline characters. Therefore we substitute all newlines (the g flag tells sed to repeat the substitute command on the input line as much as it can) with spaces. Since now the line is ready, we can add the p prefix to tell sed to print it.

The last steps are to prepare to start the script again, for the next prefix. The hold space must be empty so it can be used to store the file names that have the new prefix. That's we we have a command to replace every character in the pattern space with nothing, followed by an exchange command.

The hold space is ready. Now we must prepare the pattern space. It must contain only the first line of the file name with the new prefix. To be in that state, all we have to do is remove the old prefix, which is stored in the first line. We could do something like s/.*\n// to replace all characters except for the characters of the last line (which contains the file name with the new prefix), but the D command will do that and force the script to start executing again without reading another line, so it saves us some typing.

Although the script might be a little cryptic and the description overwhelming, once you understand what happens, it starts to become simple(r) =)

Something that must be mentioned: the input must be sorted (or at least the files with the same prefixes must be grouped together).

Hope this helps!

Upvotes: 1

potong

Reputation: 58483

This might work for you (GNU sed):

sed '$!N;s/^\(\(.*\)_.*_.*\)\n/\2=\1 /' file

Upvotes: 1

Beta

Reputation: 99134

Using sed:

sed 'N;s/\n/ /;s/\([^_]*_Apple\)/\1=\1/'

Upvotes: 0

Gilles Quénot

Reputation: 185530

Using a short awk one-liner :

awk -F'_' '{if (NR % 2) {printf("%s_%s=%s", $1, $2, $0)} else {print}}' FILE

Upvotes: 0

choroba

Reputation: 241988

Using Perl:

perl -pe 'if ($. % 2) { /([0-9]+_Apple)/ and print "$1="; s/\s+$/ /; }'

On odd lines, match the ...Apple, output it with =, and replace the whitespace at the end of line with a single space.

Note: Unix variables cannot have names starting with numbers.

Upvotes: 0

Extracting a list of files and creating a new file containing this list [Part 2]

Answers (7)

Related Questions