antimuon
antimuon

Reputation: 262

extract part of a file path between pattern using awk

I am trying to extract data from a list of file paths as I am looking to create a log of files that have completed loading. The problem is that each file path is inconsistent so I need to look for part of the file path between two regex patterns.

For example say I want to pull out two pieces of information...let's say, the data between /system/.../ and another piece of data between /data/.../sales/

/user/project-x/system/ibm/nyc/data/customers/sales/yyyymmdd
/user/project-x/system/mysql/data/regional/sales/yyyymmdd
/user/project-x/system/mysql/london/data/customers/sales/yyyymmdd
/user/project-x/system/oracle/data/tokyo-customers/Sales/yyyymmdd

So when I run the awk script I would be left with...

ibm      customers
mysql    regional
mysql    customers
oracle   tokyo-customers

Is there anyway to do that type of file path splitting?

Upvotes: 2

Views: 1695

Answers (2)

Jean-François Fabre
Jean-François Fabre

Reputation: 140178

I see that your path parts are fixed, so no need to use regexes. Field separation does the trick:

awk -F/ '{print $4,$7}' test.txt

(where test.txt is your input file)

Basically you're telling awk to consider slashes as field separator, and print field #4 and #7.

But to answer your question with field lookup do this (more complicated though)

awk -F/ '{a="???";b="???";for (i=0;i<NF;i++) {if (tolower($i)=="system") a= $(i+1); if (($i=="data") && (tolower($(i+2))=="sales")) b = $(i+1)}; print a,b}' test.txt

This will split the fields as before, but will lookup previous/next field values and print next/previous field. Even if the fields are not at fixed positions that will work. If pattern is nowhere to be found, this will display ??? instead.

I have included lowercase conversion since there's an occurrence of Sales as mixed case.

Upvotes: 4

heemayl
heemayl

Reputation: 42017

With sed:

sed -E 's_.*/system/([^/]+).*/data/([^/]+)/[Ss]ales/.*_\1 \2_'
  • .*/system/([^/]+).* matches the portion after /system/, and upto next /, and put in captured group 1

  • /data/([^/]+)/[Ss]ales/ matches the portion between /data/ and /sales/ (or /Sales/) and put in second captured group

  • In the replacement the the captured groups are used, separated by space

Example:

$ cat file.txt
/user/project-x/system/ibm/nyc/data/customers/sales/yyyymmdd
/user/project-x/system/mysql/data/regional/sales/yyyymmdd
/user/project-x/system/mysql/london/data/customers/sales/yyyymmdd
/user/project-x/system/oracle/data/tokyo-customers/Sales/yyyymmdd

$ sed -E 's_.*/system/([^/]+).*/data/([^/]+)/[Ss]ales/.*_\1 \2_' file.txt
ibm customers
mysql regional
mysql customers
oracle tokyo-customers

Upvotes: 1

Related Questions