Tinuz
Tinuz

Reputation: 1

Hive combine column values based upon condition

I was wondering if it is possible to combine column values based upon a condition. Let me explain...

Let say my data looks like this

Id name offset
1 Jan 100
2 Janssen 104
3 Klaas 150
4 Jan 160
5 Janssen 164

An my output should be this

Id fullname offsets
1 Jan Janssen [ 100, 160 ]

I would like to combine the name values from two rows where the offset of the two rows are no more apart then 1 character.

My question is if this type of data manipulation is possible with and if it is could someone share some code and explaination?

Please be gentle but this little piece of code return some what what I want...

    ArrayList<String> persons = new ArrayList<String>();

    // write your code here
    String _previous = "";

    //Sample output form entities.txt
    //USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660
    //USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685
    File file = new File("entities.txt");

    try {
        //
        // Create a new Scanner object which will read the data
        // from the file passed in. To check if there are more
        // line to read from it we check by calling the
        // scanner.hasNextLine() method. We then read line one
        // by one till all line is read.
        //
        Scanner scanner = new Scanner(file);
        while (scanner.hasNextLine()) {

            if(_previous == "" || _previous == null)
                _previous = scanner.nextLine();

            String _current = scanner.nextLine();
            //Compare the lines, if there offset is = 1
            int x = Integer.parseInt(_previous.split(",")[3]) + Integer.parseInt(_previous.split(",")[4]);
            int y = Integer.parseInt(_current.split(",")[4]);
            if(y-x == 1){
                persons.add(_previous.split(",")[1] + " " + _current.split(",")[1]);
                if(scanner.hasNextLine()){
                    _current = scanner.nextLine();
                }
            }else{
                persons.add(_previous.split(",")[1]);
            }
            _previous = _current;
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

    for(String person : persons){
        System.out.println(person);
    }

Working of this piece sample data

USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Richard,PERSON,7,2732
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,2740
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,2756
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,3093
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,3195
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,3220
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,10858
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,11063
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Ken,PERSON,3,11186
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,11234
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,17073
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,17095
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Stephanie,PERSON,9,17330
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Putt,PERSON,4,17340

Which produces this output

Richard Marottoli
Marottoli
Marottoli
Marottoli
Berkowitz
Berkowitz
Marottoli
Lea
Lea
Ken
Marottoli
Berkowitz
Lea
Stephanie Putt

Kind regards

Upvotes: 0

Views: 1622

Answers (1)

Ram Manohar
Ram Manohar

Reputation: 1004

Load the table using below create table

drop table if exists default.stack;
create external table default.stack
(junk string,
  name string,
 cat string,
 len int,
 off int
 )
 ROW FORMAT DELIMITED
 FIELDS terminated  by ','
 STORED AS INPUTFORMAT                                                  
  'org.apache.hadoop.mapred.TextInputFormat'                           
OUTPUTFORMAT                                                           
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 location 'hdfs://nameservice1/....';

Use below query to get your desired output.

select max(name), off from (
select CASE when b.name is not null then
            concat(b.name," ",a.name)
            else
            a.name
       end as name
       ,Case WHEN b.off1 is not null
             then b.off1
             else a.off
        end as off
from default.stack a
left outer join (select name 
                       ,len+off+ 1 as off
                       ,off as off1
                 from default.stack) b
on a.off = b.off ) a
group by off
order by off;

I have tested this it generates your desired result.

Upvotes: 1

Related Questions