Reputation: 1787
I am trying to 'follow-tcp-stream' in Hadoop sequence file that structured as follows:
i. Time stamp as key
ii. Raw Ethernet frame as value
The file contains a single TCP session, and because the record is very long, sequence-id of TCP frame overflows (which means that seq-id not necessarily unique and data cannot be sorted by seq-id because then it will get scrambled).
I use Apache Spark/Python/Scapy.
To create the TCP-stream I intended to:
1.) Filter out any non TCP-with-data frames
2.) Sort the RDD by TCP-sequence-ID (within each overflow cycle)
3.) Remove any duplicates of sequence-ID (within each overflow cycle)
4.) Map each element to TCP data
5.) Store the resulting RDD as testFile within HDFS
Illustration of operation on RDD:
input:
[(time:100, seq:1), (time:101, seq:21), (time:102, seq:11), (time:103, seq:21), ... , (time:1234, seq=1000), (time:1235, seq:2), (time:1236, seq:30), (time:1237, seq:18)]
output:
[(seq:1, time:100), (seq:11, time:102), (seq:21, time:101), ... ,(seq=1000, time:1234), (seq:2, time:1235), (seq:18, time:1237), (seq:30, time:1236)]
Steps 1 and 4 or obvious. The ways I came up for solving 2 and 3 required comparison between adjacent elements within the RDD, with the option to return any number of new elements (not necessarily 2, Without making any action of course - so the code will run in parallel). Is there any way to do this? I went over RDD class methods few times nothing came up.
Another issue the storage of the RDD (step 5). Is it done in parallel? Each node stores his part of the RDD to different Hadoop block? Or the data first forwarded to Spark client app and then it stores it?
Upvotes: 0
Views: 275
Reputation: 7452
It seems like what your looking for might be best done with something like reduceByKey
where you can remove the duplicates as you go for each sequence (assuming that the resulting amount of data for each sequence isn't too large, in your example it seems pretty small). Sorting the results can be done with the standard sortBy
operator.
Saving the data out to HDFS is indeed done in parallel on the workers, forwarding the data to the Spark client app would create a bottleneck and sort of defeat the purpose (although if you do want to bring the data back locally you can use collect
provided that the data is pretty small).
Upvotes: 1