Catalin
Catalin

Reputation: 122

Parse a file under linux

I'm trying to compute some news article popularity based on twitter data. However, while retrieving the tweets I forgot to escape the characters ending up with an unusable file.

Here is a line from the file:

1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80$,$000$,$ up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews

The '$,$' pattern occurs not only as a field delimiter but also in the tweet, from where I want to remove it. A correct line would be:

1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80000 up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews

I tried to use cut and sed but I'm not getting the results I want. What would be a good strategy to solve this?

Upvotes: 2

Views: 111

Answers (1)

janos
janos

Reputation: 124646

If we can assume that there are never extra separators in the time, id, retweets, username, and link fields, then you could take the middle part and remove all $,$ from it, for example like this:

perl -ne 'chomp; @a=split(/\$,\$/); $_ = join("", @a[4..($#a-1)]); print join("\$,\$", @a[0..3], $_, $a[$#a]), "\n"' < data.txt

What this does:

  1. splits the line using $,$ as delimiter
  2. takes the middle part = fields[4] .. fields[N-1]
  3. joins again by $,$ the first 4 fields, the fixed middle part, and the last field (the link)

This works with your example, but I don't know what other corner cases you might have.

A good way to validate the result is to count the number of occurrences of $,$ is 6 on all lines. You can do that by piping the result to this:

... | perl -ne 'print scalar split(/\$,\$/), "\n"' | sort -u

(should output a single line, with "6")

Upvotes: 4

Related Questions