Reputation: 9
I am trying to search a particular line from a very big log file. I am able to search the line.
Now using that line space I want to create a dataframe,I am unable to do that. I have tried below code but unable to achieve.
from pyspark import SparkConf,SparkContext
from pyspark import SQLContext
from pyspark.sql.types import *
from pyspark.sql import *
conf=SparkConf().setMaster("local").setAppName("invparsing")
sc=SparkContext(conf=conf)
sql=SQLContext(sc)
def f(x) :print(x)
data_frame_schema=StructType([
StructField("Typeof",StringType()),
#StructField("Produt_mod",StringType()),
#StructField("Col2",StringType()),
#StructField("Col3",StringType()),
#StructField("Col4",StringType()),
#StructField("Col5",StringType()),
])
path="C:/rk/IBMS/inv.log"
lines=sc.textFile(path)
NodeStr=lines.filter(lambda x:'Node :RBS6301' in x).map(lambda x:x.split(" +"))
NodeStr.foreach(f)
Nodedf=sql.createDataFrame(NodeStr,data_frame_schema)
Nodedf.show(truncate=False)
Now, I am getting output here - only one single string. O want to split value on the basis of space.
[u'Node: RBS6301 XP10521/26 R30F L17A.4-6 (C17.0_LSV_PS4)']
+-------------------------------------------------------------+
|Typesof |
+-------------------------------------------------------------+
|Node: RBS6301 XP10521/26 R30F L17A.4-6 (C17.0_LSV_PS4)
+-------------------------------------------------------------+
Expected output:
Typeof Produt_mod Col2 Col3 Col4 COL5
Node RBS6301 XP10521/26 R30F L17A.4-6 C17.0_LSV_PS4
Upvotes: 1
Views: 5646
Reputation: 35229
The first mistake you made is here:
lambda x:x.split(" +")
str.split
takes a constant string not a regular expression. To split on a whitespace you should just omit separator
lines = sc.parallelize(["Node: RBS6301 XP10521/26 R30F L17A.4-6 (C17.0_LSV_PS4)"])
lines.map(lambda s: s.split()).first()
# ['Node:', 'RBS6301', 'XP10521/26', 'R30F', 'L17A.4-6', '(C17.0_LSV_PS4)']
Once you've done that you can just filter and convert to a DataFrame
:
df = lines.map(lambda s: s.split()).filter(lambda x: len(x) == 6).toDF(
["col1", "col2", "col3", "col4", "col5", "col6"]
)
df.show()
# +-----+-------+----------+----+--------+---------------+
# | col1| col2| col3|col4| col5| col6|
# +-----+-------+----------+----+--------+---------------+
# |Node:|RBS6301|XP10521/26|R30F|L17A.4-6|(C17.0_LSV_PS4)|
# +-----+-------+----------+----+--------+---------------+
and filter
:
df[df["col2"] == "RBS6301"].show()
# +-----+-------+----------+----+--------+---------------+
# | col1| col2| col3|col4| col5| col6|
# +-----+-------+----------+----+--------+---------------+
# |Node:|RBS6301|XP10521/26|R30F|L17A.4-6|(C17.0_LSV_PS4)|
# +-----+-------+----------+----+--------+---------------+
Upvotes: 2