Parse a file under linux

Question

I'm trying to compute some news article popularity based on twitter data. However, while retrieving the tweets I forgot to escape the characters ending up with an unusable file.

Here is a line from the file:

1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80$,$000$,$ up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews

The '$,$' pattern occurs not only as a field delimiter but also in the tweet, from where I want to remove it. A correct line would be:

1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80000 up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews

I tried to use cut and sed but I'm not getting the results I want. What would be a good strategy to solve this?

janos · Accepted Answer

If we can assume that there are never extra separators in the time, id, retweets, username, and link fields, then you could take the middle part and remove all $,$ from it, for example like this:

perl -ne 'chomp; @a=split(/\$,\$/); $_ = join("", @a[4..($#a-1)]); print join("\$,\$", @a[0..3], $_, $a[$#a]), "
"' < data.txt

What this does:

splits the line using $,$ as delimiter
takes the middle part = fields[4] .. fields[N-1]
joins again by $,$ the first 4 fields, the fixed middle part, and the last field (the link)

This works with your example, but I don't know what other corner cases you might have.

A good way to validate the result is to count the number of occurrences of $,$ is 6 on all lines. You can do that by piping the result to this:

... | perl -ne 'print scalar split(/\$,\$/), "
"' | sort -u

(should output a single line, with "6")

Parse a file under linux

Answers (1)

Related Questions