PySpark || Fresco Play || 58339

Pyspark - Creating Dataframe

In this hands-on, you will start working on PySpark basics. Follow the below instructions to complete the hands-on:

STEPS:

Step 1: Import the SparkSession package.

Step 2: Create a SparkSession object.

Step 3: Create a DataFrame using SparkSession object, and display the DataFrame.

Step4: Create a simple passenger instance with 4 fields, Name, age, source, destination.

Step5: Create 2 passenger fields with following values

  • David , 22, London, Paris
  • Steve, 22, New York , Sydney

Step 6: Display the DataFrame to check the data.

NOTE:

Don't edit the line at the bottom of the answer.py file which will save the created DataFrame into a file.

Steps to complete the Handson:

1. Install all the necessary dependencies using the INSTALL option from the Project menu.

2. Run your solution using the RUN -> Run option.

3. Run the tests to check using the TEST option from the Project menu.

#IMP : In order to execute your code in answer.py file, please open a new terminal and use the command "spark-submit answer.py"

Git Instructions

Use the following commands to work with this project

spark-submit answer.py

spark-submit score.py

echo "Installation not needed"

Solution

from pyspark.sql import SparkSession, Row

sc = SparkSession.builder.appName("Example App").getOrCreate()
Passenger = Row("Name", "age", "source", "destination")
p1 = Passenger("David" , 22, "London", "Paris")
p2 = Passenger("Steve", 22, "New York", "Sydney")
passData = [p1, p2]
df = sc.createDataFrame(passData)
df.show()
# Don't Remove this line 
df.coalesce(1).write.parquet("PassengerData")

Reading files in PySpark

In this hands-on, you will start working on PySpark basics. Follow the below instructions to complete the hands-on:

STEPS:

Step 1: Import the SparkSession package.

Step 2: Create a SparkSession object.

Step 3: Read the json file, and create a DataFrame with the json data. Display the DataFrame. Save the DataFrame to a parquet file with name Employees.

Step 4: From the DataFrame, display the associates who are mapped to `JAVA` stream. Save the resultant DataFrame to a parquet file with name JavaEmployees.

NOTE:

1. Use coalesce to store the data frame as a single partition.

2. Ensure to use the exact naming convention for the result folders.

Steps to complete the Handson:

1. Run your solution using the RUN option from Project menu.

2. Run the tests to check using the TEST option from the Project menu.

Git Instructions

Use the following commands to work with this project

spark-submit answer.py

spark-submit score.py

echo "No need to install"

Solution

# Put your code here
from pyspark.sql import SparkSession

sc = SparkSession.builder.appName("Example App").getOrCreate()
df = sc.read.load("emp.json", format="json")

df.show(2)
df.coalesce(1).write.parquet("Employees")

df_java = df.filter(df.stream == 'JAVA')
df_java.coalesce(1).write.parquet("JavaEmployees")

Statistical and Mathematical Functions with DataFrames in Apache Spark

In this hands-on, you will start working on PySpark basics. Follow the below instructions to complete the hands-on:

STEPS:

Step 1: Import the pyspark and SparkSession package.

Step 2: Create a SparkSession object.

Step 3: Create 10 random values as a column and name the column as rand1

Step 4: Create another 10 random values as column and name the column as rand2.

Step 4: Calculate the co-variance and correlation between these two columns.

Step5: Create a new Dataframe with Header names as "Stats" and "Value"

Step6: Fill the new Dataframe with the obtained values as "Co-variance" and "Correlation"

Step7: Save the resultant DataFrame to a CSV file with name Result

NOTE:

1. Use coalesce to store the data frame as a single partition.

2. Ensure to use the exact naming convention for the result files.

Steps to complete the Handson:

1. Run your solution using the RUN option from Project menu.

2. Run the tests to check using the TEST option from the Project menu.

#IMP : In order to execute your code in answer.py file, please open a new terminal and use the command "spark-submit answer.py"

Git Instructions

Use the following commands to work with this project

spark-submit answer.py

spark-submit score.py

bash install.sh

Solution

# Put your code here
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import rand

sc = SparkSession.builder.appName("Example App").getOrCreate()
df = sc.range(0, 10).withColumn('rand1', rand(seed=11)).withColumn('rand2', rand(seed=31))

Stat = Row("Stats", "Value")
stat1 = Stat("Co-variance", df.stat.cov('rand1', 'rand2'))
stat2 = Stat("Correlation", df.stat.corr('rand1', 'rand2'))
stat_df = sc.createDataFrame([stat1, stat2])
stat_df.write.csv("Result", header = True)

More operations in PySpark

In this hands-on, you will start working on PySpark basics. Follow the below instructions to complete the hands-on:

STEPS:

Step 1: Import the SparkSession package.

Step 2: Create a SparkSession object.

Step 3: Create a DataFrame with the following details under the headers as "ID", "Name","Age","Area of Interest"

Step 4: Fill the Dataframe with the following data:

  • "1","Jack", 22,"Data Science"
  • "2","Luke", 21,"Data Analytics"
  • "3","Leo", 24,"Micro Services"
  • "4","Mark", 21,"Data Analytics"

Step 5: Use describe method on Age column and observe the statistical parameters and save the data into a parquet file under the folder with name "Age" inside /projects/challenge/.

Step 6: Select the columns ID, Name, and Age, and Name should be in descending order. Save the resultant into a parquet file under the folder with name "NameSorted" inside /projects/challenge/.

NOTE:

1. Use coalesce to store the data frame as a single partition.

2. Ensure to use the exact naming convention for the result folders.

Steps to complete the Handson:

1. Run your solution using the RUN option from Project menu.

2. Run the tests to check using the TEST option from the Project menu.

Git Instructions

Use the following commands to work with this project

spark-submit answer.py

spark-submit score.py

bash install.sh

Solution

from pyspark.sql import SparkSession, Row

sc = SparkSession.builder.appName("Example App").getOrCreate()
Employee = Row("ID", "Name","Age","Area of Interest")
emp1 = Employee("1","Jack", 22,"Data Science")
emp2 = Employee("2","Luke", 21,"Data Analytics")
emp3 = Employee("3","Leo", 24,"Micro Services")
emp4 = Employee("4","Mark", 21,"Data Analytics")
empData = [emp1, emp2, emp3, emp4]
df = sc.createDataFrame(empData)

df.describe("Age").coalesce(1).write.parquet("Age")

df.sort("Name", ascending=False).coalesce(1).write.parquet("NameSorted")
Share:

0 comments:

Post a Comment