Learner
Learner

Reputation: 757

how can I replace text in 10000 lines

I have over 10000 of such files and I am trying to make them as a template

my strings are like this

"MLKT_3C_AAAU_01A" 
"MLKT_3C_AALI_01A"
"MLKT_3C_AALJ_01A" 
"MLKT_3C_AALK_01A"
"MLKT_4H_AAAK_01A"

I am trying to convert them to this

names(MLKT_3C_AAAU_01A)[2] <- '3C_AAAU_01A' df<- full_join(df,MLKT_CS_4942_01A, by = 'V1')
names(MLKT_3C_AALI_01A)[2] <- '3C_AALI_01A' df<- full_join(df,MLKT_3C_AALI_01A, by = 'V1')
names(MLKT_3C_AALJ_01A)[2] <- '3C_AALJ_01A' df<- full_join(df,MLKT_3C_AALJ_01A, by = 'V1')
names(MLKT_3C_AALK_01A)[2] <- '3C_AALK_01A' df<- full_join(df,MLKT_3C_AALK_01A, by = 'V1')
names(MLKT_4H_AAAK_01A)[2] <- '4H_AAAK_01A' df<- full_join(df,MLKT_4H_AAAK_01A, by = 'V1')

The best way I came across until now was to use a text editor and make them one by one. I am wondering if there is a way in bash to get the above strings and convert it to the example I provided ?

before I start, I remove quotation from each line

sed 's/\"//g' example.txt > exampleout.txt

AT first I try to add names( at the beging of each line . so lets imagine my file which has all those strings per line is called exampleout.txt. which gives me three time names( instead once

awk '$0="names("$0' exampleout.txt > myout.txt

Then I try to paste )[2] <- '' df<- full_join(df,, by = 'V1') at the end of each line using the following

sed -e 's/$/)[2] <- '' df<- full_join(df,, by = 'V1') /' myout.txt > myout2.txt

so it led me to this

names(MLKT_3C_AAAU_01A )[2] <-  df<- full_join(df,, by = V1) 
names(MLKT_3C_AALI_01A)[2] <-  df<- full_join(df,, by = V1) 
names(MLKT_3C_AALJ_01A )[2] <-  df<- full_join(df,, by = V1) 
names(MLKT_3C_AALK_01A)[2] <-  df<- full_join(df,, by = V1) 
names(MLKT_4H_AAAK_01A)[2] <-  df<- full_join(df,, by = V1) 

Upvotes: 0

Views: 113

Answers (4)

TrebledJ
TrebledJ

Reputation: 8987

You can actually do it all in one command. The script below is similar to sed, only I've chosen to use perl to exploit non-greedy matching (.*?_(.*)) to separate the first underscored field.

perl -pe "s/^\"(.*?_(.*))\"$/names(\1)[2] <- '\2' df <- full_join(df, \1, by 'V1')/" example.txt

Here, I've captured two strings.

  1. Everything inside the double-quotes, and
  2. Everything after the first underscore.

For instance, in "MLKT_3C_AAAU_01A", the first capture would be MLKT_3C_AAAU_01A and the second capture would be 3C_AAAU_01A.

Afterwards, the appropriate substitutions are made.


If the field preceding the first underscore is a constant (e.g. MLKT), you could use sed, replacing the non-greedy match with the constant.

sed -E "s/^\"(MLKT_(.*))\"$/names(\1)[2] <- '\2' df <- full_join(df, \1, by 'V1')/" test.txt

Note the use of the -E flag (for extended regexes/easier group-capturing) and the use of double quotes (for using single-quotes as part of the replacement).

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 203229

$ awk -F'"' '{
    x=$2; sub(/^[^_]+_/,"",x)
    printf "names(%s)[2] <- \047%s\047 df<- full_join(df,%s, by = \047V1\047)\n", $2, x, $2
}' file
names(MLKT_3C_AAAU_01A)[2] <- '3C_AAAU_01A' df<- full_join(df,MLKT_3C_AAAU_01A, by = 'V1')
names(MLKT_3C_AALI_01A)[2] <- '3C_AALI_01A' df<- full_join(df,MLKT_3C_AALI_01A, by = 'V1')
names(MLKT_3C_AALJ_01A)[2] <- '3C_AALJ_01A' df<- full_join(df,MLKT_3C_AALJ_01A, by = 'V1')
names(MLKT_3C_AALK_01A)[2] <- '3C_AALK_01A' df<- full_join(df,MLKT_3C_AALK_01A, by = 'V1')
names(MLKT_4H_AAAK_01A)[2] <- '4H_AAAK_01A' df<- full_join(df,MLKT_4H_AAAK_01A, by = 'V1')

Upvotes: 0

tripleee
tripleee

Reputation: 189327

Replacing a regex match with something is easily done with sed.

sed 's/^"\(MLKT_\([^"]*\)\)"$/things with \1 and even \2 in it/' file >newfile

The expression \1 in the replacement text corresponds to the first parenthesized group in the regular expression, and \2 corresponds to the second. So if you matched MLKT_1234 then \1 will be the entire string, and \2 will be 1234.

If you need single quotes in the replacement, you have to unwrap them somehow. Perhaps the simplest mechanic replacement is to express each literal single quote as '\'' which is a closing single quote for the single-quoted string you are in, then a literal unquoted but backslashed single quote, and then an opening single quote to continue single-quoting the text which follows.

For any nontrivial replacements, though, perhaps you want to investigate Awk, which is somewhat more human-readable.

awk '{ # replace double quotes with nothing
    sub(/^"/, ""); sub(/"$/, "");
    # Now you can use $0 to refer to the remaining string
    # You can replace single quotes with \047
    print "names(" $0 ")[2] <- \047" \
        substr($0, 6) "\047 df<- full_join(df," \
        randomstring ", by = \047V1\047)" }' file >newfile

If randomstring comes from a second file, there's a common Awk pattern for joining values from two files (google for NR==FNR).

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try following.

awk -v s1="'" '
match($0,/[a-zA-Z][^"]*/){
  val=substr($0,RSTART,RLENGTH)
  split(val,array,"_")
  print "names(" val"[2] <- " s1 array[2]"_"array[3]"_"array[4] s1 " df<- full_join(df," val", by = " s1 "V1" s1")"
}'  Input_file

Output will be as follows.

names(MLKT_3C_AAAU_01A[2] <- '3C_AAAU_01A' df<- full_join(df,MLKT_3C_AAAU_01A, by = 'V1')
names(MLKT_3C_AALI_01A[2] <- '3C_AALI_01A' df<- full_join(df,MLKT_3C_AALI_01A, by = 'V1')
names(MLKT_3C_AALJ_01A[2] <- '3C_AALJ_01A' df<- full_join(df,MLKT_3C_AALJ_01A, by = 'V1')
names(MLKT_3C_AALK_01A[2] <- '3C_AALK_01A' df<- full_join(df,MLKT_3C_AALK_01A, by = 'V1')
names(MLKT_4H_AAAK_01A[2] <- '4H_AAAK_01A' df<- full_join(df,MLKT_4H_AAAK_01A, by = 'V1')

Upvotes: 2

Related Questions