Apache Spark RDD transformations with 2 elements as input

Question

I am trying to 'follow-tcp-stream' in Hadoop sequence file that structured as follows:
i. Time stamp as key
ii. Raw Ethernet frame as value

The file contains a single TCP session, and because the record is very long, sequence-id of TCP frame overflows (which means that seq-id not necessarily unique and data cannot be sorted by seq-id because then it will get scrambled). I use Apache Spark/Python/Scapy.

To create the TCP-stream I intended to:
1.) Filter out any non TCP-with-data frames
2.) Sort the RDD by TCP-sequence-ID (within each overflow cycle)
3.) Remove any duplicates of sequence-ID (within each overflow cycle)
4.) Map each element to TCP data
5.) Store the resulting RDD as testFile within HDFS

Illustration of operation on RDD:
input:
[(time:100, seq:1), (time:101, seq:21), (time:102, seq:11), (time:103, seq:21), ... , (time:1234, seq=1000), (time:1235, seq:2), (time:1236, seq:30), (time:1237, seq:18)]
output:
[(seq:1, time:100), (seq:11, time:102), (seq:21, time:101), ... ,(seq=1000, time:1234), (seq:2, time:1235), (seq:18, time:1237), (seq:30, time:1236)]

Steps 1 and 4 or obvious. The ways I came up for solving 2 and 3 required comparison between adjacent elements within the RDD, with the option to return any number of new elements (not necessarily 2, Without making any action of course - so the code will run in parallel). Is there any way to do this? I went over RDD class methods few times nothing came up.
Another issue the storage of the RDD (step 5). Is it done in parallel? Each node stores his part of the RDD to different Hadoop block? Or the data first forwarded to Spark client app and then it stores it?

Apache Spark RDD transformations with 2 elements as input

Answers (1)

Related Questions