Reputation: 29
I am Using a Jupyter notebook (google colab) to try and extract data from a .7z file into a pandas dataframe, using linux commands. The data is from http://untroubled.org/spam/ . I wish to extract only the data from the 2020-01.7z file. so far I have,
!wget http://untroubled.org/spam/2020-01.7z
!7z x 2020-01.7z
import pandas as pd
import py7zr
archive = py7zr.SevenZipFile('2020-01.7z', mode='r')
archive.extractall(path="/tmp")
with open ('2020-01.7z', 'r') as myfile:
myfile.read()
mydf = pd.DataFrame(myfile)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 2: invalid
start byte
I'm not really sure what the "/tmp" mean. I know there is a way to do this I just don't have enough understanding yet of these commands and how to use them. Any help is appreciated
Upvotes: 0
Views: 1333