Reputation: 1

Using Spark to merge two or more files content and manipulate the content

Can I use Spark to do the following? I have three files to merge and change the contents:

First File called column_header.tsv with this content:
```
first_name last_name address zip_code browser_type
```

Second file called data_file.tsv with this content:

John    Doe 111 New Drive, Ca   11111   34
Mary    Doe 133 Creator Blvd, NY    44499   40
Mike    Coder   13  Jumping Street  UT  66499   28

Third file called browser_type.tsv with content:
```
34  Chrome
40  Safari
28  FireFox
```

The final_output.tsv file after Spark processing the above should have this contents:

first_name last_name address zip_code browser_type
John    Doe 111 New Drive, Ca   11111   Chrome
Mary    Doe 133 Creator Blvd, NY    44499   Safari
Mike    Coder   13  Jumping Street  UT  66499   FireFox

Is this do able using Spark? Also I will consider Sed or Awk if it is possible use the tools. I know the above is possible with Python but I will prefer using Spark to do the data manipulation and changes. Any suggestions? Thanks in advance.

Upvotes: 0

Answers (2)

James Brown

Reputation: 37464

Here it is in awk, just in case. Notice the file order:

$ awk 'NR==FNR{ a[$1]=$2;next }{ $NF=($NF in a?a[$NF]:$NF) }1' file3 file1 file2

Output:

first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox

Explained:

NR==FNR {                            # process browser_type file
    a[$1]=$2                         # remember remember the second of ...
    next }                           # skip to the next record
{                                    # process the other files
    $NF=( $NF in a ? a[$NF] : $NF) } # replace last field with browser from a
1                                    # implicit print

Upvotes: 1

user7337271

Reputation: 1712

It is possible. Read header:

with open("column_header.tsv") as fr:
    columns = fr.readline().split()

Read data_file.tsv:

users = spark.read.option("delimiter", "\t").csv("data_file.tsv").toDF(*columns)

Read called browser_type.tsv:

browsers = spark.read.csv("called browser_type.tsv") \
    .toDF("browser_type", "browser_name")

Join:

users.join(browser, "browser_type", "left").write.csv(path)

Upvotes: 0

Using Spark to merge two or more files content and manipulate the content

Answers (2)

Related Questions