Reputation: 122
I'm trying to compute some news article popularity based on twitter data. However, while retrieving the tweets I forgot to escape the characters ending up with an unusable file.
Here is a line from the file:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80$,$000$,$ up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
The '$,$' pattern occurs not only as a field delimiter but also in the tweet, from where I want to remove it. A correct line would be:
1369283975$,$337427565662830592$,$0$,$username$,$Average U.S. 401(k) balance tops $80000 up 75 pct since 2009 http://t.co/etHHMUFpoo #news$,$http://www.reuters.com/article/2013/05/23/funds-fidelity-401k-idUSL2N0E31ZC20130523?feedType=RSS&feedName=marketsNews
I tried to use cut and sed but I'm not getting the results I want. What would be a good strategy to solve this?
Upvotes: 2
Views: 111
Reputation: 124646
If we can assume that there are never extra separators in the time, id, retweets, username, and link fields, then you could take the middle part and remove all $,$
from it, for example like this:
perl -ne 'chomp; @a=split(/\$,\$/); $_ = join("", @a[4..($#a-1)]); print join("\$,\$", @a[0..3], $_, $a[$#a]), "\n"' < data.txt
What this does:
$,$
as delimiter$,$
the first 4 fields, the fixed middle part, and the last field (the link)This works with your example, but I don't know what other corner cases you might have.
A good way to validate the result is to count the number of occurrences of $,$
is 6 on all lines. You can do that by piping the result to this:
... | perl -ne 'print scalar split(/\$,\$/), "\n"' | sort -u
(should output a single line, with "6")
Upvotes: 4