Reputation: 1053
I am trying to extract English titles from a wiki titles dump that's in a text file using regex in Python 3. The wiki dump contains titles in other languages also and some symbols. Below is my code:
with open('/Users/some/directory/title.txt', 'rb')as f:
text=f.read()
letters_only = re.sub(b"[^a-zA-Z]", " ", text)
words = letters_only.lower().split()
print(words)
But I am getting an error:
TypeError: sequence item 1: expected a bytes-like object, str found
at the line: letters_only = re.sub(b"[^a-zA-Z]", " ", text)
But, I am using b''
to make output as byte type, below is a sample of the text file:
Destroy-Oh-Boy!!
!!Que_Corra_La_Voz!!
!!_(chess)
!!_(disambiguation)
!'O!Kung
!'O!Kung_language
!'O-!khung_language
!337$P34K
!=
!?
!?!
!?Revolution!?
!?_(chess)
!A_Luchar!
!Action_Pact!
!Action_pact!
!Adios_Amigos!
!Alabadle!
!Alarma!
!Alarma!_(album)
!Alarma!_(disambiguation)
!Alarma!_(magazine)
!Alarma!_Records
!Alarma!_magazine
!Alfaro_Vive,_Carajo!
!All-Time_Quarterback!
!All-Time_Quarterback!_(EP)
!All-Time_Quarterback!_(album)
!Alla_tu!
!Amigos!
!Amigos!_(Arrested_Development_episode)
!Arriba!_La_Pachanga
!Ask_a_Mexican!
!Atame!
!Ay,_Carmela!_(film)
!Ay,_caramba!
!BANG!
!Bang!
!Bang!_TV
!Basta_Ya!
!Bastardos!
!Bastardos!_(album)
!Bastardos_en_Vivo!
!Bienvenido,_Mr._Marshall!
!Ciauetistico!
!Ciautistico!
!DOCTYPE
!Dame!_!Dame!_!Dame!
!Decapitacion!
!Dos!
!Explora!_Science_Center_and_Children's_Museum
!F
!Forward,_Russia!
!Forward_Russia!
!Ga!ne_language
!Ga!nge_language
!Gã!ne
!Gã!ne_language
!Gã!nge_language
!HERO
!Happy_Birthday_Guadaloupe!
!Happy_Birthday_Guadalupe!
!Hello_Friends
I have searched online but could not succeed. Any help will be appreciated.
Upvotes: 8
Views: 16002
Reputation: 3244
You can also use br'…'
, which is the byte analog to r'…'
. The replacement must also be a byte string.
letters_only = re.sub(br'[^a-zA-Z]', b' ', text)
Upvotes: 0
Reputation: 160427
The problem is with the repl
argument you supply, it isn't a bytes
object:
letters_only = re.sub(b"[^a-zA-Z]", " ", b'Hello2World')
# TypeError: sequence item 1: expected a bytes-like object, str found
Instead, supply repl
as a bytes instance b" "
:
letters_only = re.sub(b"[^a-zA-Z]", b" ", b'Hello2World')
print(letters_only)
b'Hello World'
Note: Don't prefix your literals with b
and don't open the file with rb
if you aren't looking for byte
sequences.
Upvotes: 9
Reputation: 1089
You can't use a byte
string for your regex match when the replacement string isn't.
Essentially, you can't mix different objects (byte
s and string
s) when doing most tasks. In your code above, you are using a binary search string and a binary text, but your replacement string is a regular string
. All arguments need to be of the same type, so there are 2 possible solutions to this.
Taking the above into account, your code could look like this (this will return regular string
strings, not byte
objects):
with open('/Users/some/directory/title.txt', 'r')as f:
text=f.read()
letters_only = re.sub(r"[^a-zA-Z]", " ", text)
words = letters_only.lower().split()
print(words)
Note that the code does use a special type of string for the regex - a raw string, prefixed with r
. This means that python won't interpret escape characters such as \
, which is very useful for regexes. See the docs for more details about raw strings.
Upvotes: 2
Reputation: 140178
You have to choose between binary and text mode.
Either you open your file as rb
and then you can use re.sub(b"[^a-zA-Z]", b" ", text)
(text
is a bytes
object)
Or you open your file as r
and then you can use re.sub("[^a-zA-Z]", " ", text)
(text
is a str
object)
The second solution is more "classical".
Upvotes: 4