Reputation: 103
I am collecting tweets from twitter using erlang and I am trying to save only the hashtags to a database. However when I'm converting the bitstrings to list-strings all the non-latin-letter tweets converts to strange symbols. Is there any way to check if a string is only containing alphanumeric characters in erlang?
Upvotes: 4
Views: 3591
Reputation: 13154
There are three io_lib functions specifically for this:
io_lib:printable_list/1
io_lib:printable_latin1_list/1
io_lib:printable_unicode_list/1
Here is an example of one in use:
-spec show_message(ExParent, Message) -> ok
when WxParent :: wx:wx_object(),
Message :: unicode:chardata() | term().
show_message(WxParent, Message) ->
Format =
case io_lib:printable_unicode_list(Message) of
true -> "~ts";
false -> "~tp"
end,
Modal = wxMessageDialog:new(WxParent, io_lib:format(Format, [Message])),
_ = wxMessageDialog:showModal(Modal),
ok = wxMessageDialog:destroy(Modal).
Check out the io_lib docs: http://www.erlang.org/doc/man/io_lib.html#printable_list-1
Addendum
Because this subject isn't always easy to research in Erlang a related, but slightly broader Q/A might be of interest:
How to check whether input is a string in Erlang?
Upvotes: 3
Reputation: 1380
The easiest way is to use regular expressions.
StringAlphanum = "1234abcZXYM".
StringNotAlphanum = "1ZXYMÄ#kMp&?".
re:run(StringAlphanum, "^[0-9A-Za-z]+$").
>> {match,[{0,11}]}
re:run(StringNotAlphanum, "^[0-9A-Za-z]+$").
>> nomatch
You can easily make a function out of it...
isAlphaNum(String) ->
case re:run(String, "^[0-9A-Za-z]+$") of
{match, _} -> true;
nomatch -> false
end.
But, in my opinion, the better way would be to solve the underlying Problem, the correct interpretation of unicode binary strings.
If you want to represent unicode-characters correctly, do not use binary_to_list
. Use the unicode-module instead. Unicode-binary strings can not be interpreted naiveley as binary, the UTF-8 character encoding for example has some special constraints that prevent this. For example: the most significant bit in the first character determines, if it is a multi-byte character.
I took the following example from this site, lets define a UTF8-String:
Utf8String = <<195, 164, 105, 116, 105>>.
Interpreted naiveley as binary it yields:
binary_to_list(Utf8String).
"äiti"
Interpreted with unicode-support:
unicode:characters_to_list(Utf8String, utf8).
"äiti"
Upvotes: 1
Reputation: 131
for latin chars you can use this function:
is_alpha([Char | Rest]) when Char >= $a, Char =< $z ->
is_alpha(Rest);
is_alpha([Char | Rest]) when Char >= $A, Char =< $Z ->
is_alpha(Rest);
is_alpha([Char | Rest]) when Char >= $0, Char =< $9 ->
is_alpha(Rest);
is_alpha([]) ->
true;
is_alpha(_) ->
false.
for other coding, you can add their rang of code and add them.
Upvotes: 3