BarZ
BarZ

Reputation: 21

How to split string by more than one char in mongoDB

I'm trying to split this string so i will be able to count how many words of the same length he contains with map reduce later.

For example, for the sentence

SUPPOSING that Truth is a woman--what then? I will get -

[
  {length:”1”, number:”1”}, 
  {length:”2”, number:”1”},
  {length:”4”, number:”3”},
  {length:”5”, number:”2”},
  {length:”9”, number:”1”}
]

How can i do this?

Upvotes: 1

Views: 1254

Answers (2)

dnickless
dnickless

Reputation: 10918

The answer to your question depends very much on your definition of what a word is. If it is a consecutive sequence of A-Z or a-z characters only then here is a completely nuts approach which, however, gives you the exact result you're asking for.

What this code does is effectively

  1. Parse an input string to eliminate non-matching characters (so anything that is not either A-Z or a-z).
  2. Concatenate the resulting cleansed string which will only hold valid characters.
  3. Split the resulting string by the space character.
  4. Calculate the lenght of all found words.
  5. Group by lenght and count instances.
  6. Some beautification of the output.

Given the following input document

{
    "text" : "SUPPOSING that Truth is a woman--what then?"
}

the following pipeline

db.collection.aggregate({
    $project: { // lots of magic to calulate an array that will hold the lengths of all words
        "lengths": {
            $map: { // translate a given word into its length
                input: {
                    $split: [ // split cleansed string by space character
                        { $reduce: { // join the characters that are between A and z
                                input: {
                                    $map: { // to traverse the original input string character by character
                                        input: {
                                            $range: [ 0, { $strLenCP: "$text" } ] // we wamt to traverse the entire string from index 0 all the way until the last character
                                        },
                                        as: "index",
                                        in: {
                                            $let: {
                                                vars: {
                                                    "char": { // temp. result which will be reused several times below
                                                        $substrCP: [ "$text", "$$index", 1 ] // the single character we look at in this loop
                                                    }
                                                },
                                                in: {
                                                    $cond: [ // some value that depends on whether the character we look at is between 'A' and 'z'
                                                        { $and: [
                                                            { $eq: [ { $cmp: [ "$$char", "@" /* ASCII 64,  65  would be 'A' */] },  1 ] }, // is our character greater than or equal to 'A'
                                                            { $eq: [ { $cmp: [ "$$char", "{" /* ASCII 123, 122 would be 'z' */] }, -1 ] }  // is our character less than    or equal to 'z' 
                                                        ]},
                                                        '$$char', // in which case that character will be taken
                                                        ' ' // and otherwise a space character to add a word boundary
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                },
                                initialValue: "", // starting with an empty string
                                in: {
                                    $concat: [ // we join all array values by means of concatenating
                                        "$$value", // the current value with
                                        "$$this"
                                    ]
                                }
                            }
                        },
                        " "
                    ]
                },
                as: "word",
                in: {
                    $strLenCP: "$$word" // we map a word into its length, e.g. "the" --> 3
                }
            }
        }
    }
}, {
    $unwind: "$lengths" // flatten the array which holds all our word lengths
}, {
    $group: {
        _id : "$lengths", // group by the length of our words
        "number": { $sum: 1 }  // count number of documents per group
    } 
}, {
    $match: {
        "_id": { $ne: 0 } // $split might leave us with strings of length 0 which we do not want in the result
    }
}, {
    $project: {
        "_id": 0, // remove the "_id" field
        "length" : "$_id", // length is our group key
        "number" : "$number" // and this is the number of findings
    }
}, {
    $sort: { "length": 1 } // sort by length ascending
})

will produce the desired output

[
    { "length" : 1, "number" : 1.0 },
    { "length" : 2, "number" : 1.0 },
    { "length" : 4, "number" : 3.0 },
    { "length" : 5, "number" : 2.0 },
    { "length" : 9, "number" : 1.0 }
]

Upvotes: 1

Vicctor
Vicctor

Reputation: 833

This sample aggregation will count words of the same length. Hope it will help you:

db.some.remove({})
db.some.save({str:"red brown fox jumped over the hil"})

var res = db.some.aggregate(
    [
    { $project : { word : { $split: ["$str", " "] }} },
    { $unwind : "$word" },
    { $project : { len : { $strLenCP: "$word" }} },
    { $group : { _id : { len : "$len"}, same: {$push:"$len"}}},
    { $project : { len : "$len", count : {$size : "$same"} }}
    ]
)

printjson(res.toArray());

Upvotes: 0

Related Questions