Pyspark - Creating Dataframe
In this hands-on, you will start working on PySpark basics. Follow the below instructions to complete the hands-on:
STEPS:
Step 1: Import the SparkSession package.
Step 2: Create a SparkSession object.
Step 3: Create a DataFrame using SparkSession object, and display the DataFrame.
Step4: Create a simple passenger instance with 4 fields, Name, age, source, destination.
Step5: Create 2 passenger fields with following values
- David , 22, London, Paris
- Steve, 22, New York , Sydney
Step 6: Display the DataFrame to check the data.
NOTE:
Don't edit the line at the bottom of the answer.py file which will save the created DataFrame into a file.
Steps to complete the Handson:
1. Install all the necessary dependencies using the INSTALL option from the Project menu.
2. Run your solution using the RUN -> Run option.
3. Run the tests to check using the TEST option from the Project menu.
#IMP : In order to execute your code in answer.py file, please open a new terminal and use the command "spark-submit answer.py"
Git Instructions
Use the following commands to work with this project
spark-submit answer.py
spark-submit score.py
echo "Installation not needed"
Solution
from pyspark.sql import SparkSession, Row
sc = SparkSession.builder.appName("Example App").getOrCreate()
Passenger = Row("Name", "age", "source", "destination")
p1 = Passenger("David" , 22, "London", "Paris")
p2 = Passenger("Steve", 22, "New York", "Sydney")
passData = [p1, p2]
df = sc.createDataFrame(passData)
df.show()
# Don't Remove this line
df.coalesce(1).write.parquet("PassengerData")
Reading files in PySpark
In this hands-on, you will start working on PySpark basics. Follow the below instructions to complete the hands-on:
STEPS:
Step 1: Import the SparkSession package.
Step 2: Create a SparkSession object.
Step 3: Read the json file, and create a DataFrame with the json data. Display the DataFrame. Save the DataFrame to a parquet file with name Employees.
Step 4: From the DataFrame, display the associates who are mapped to `JAVA` stream. Save the resultant DataFrame to a parquet file with name JavaEmployees.
NOTE:
1. Use coalesce to store the data frame as a single partition.
2. Ensure to use the exact naming convention for the result folders.
Steps to complete the Handson:
1. Run your solution using the RUN option from Project menu.
2. Run the tests to check using the TEST option from the Project menu.
Git Instructions
Use the following commands to work with this project
spark-submit answer.py
spark-submit score.py
echo "No need to install"
Solution
# Put your code here
from pyspark.sql import SparkSession
sc = SparkSession.builder.appName("Example App").getOrCreate()
df = sc.read.load("emp.json", format="json")
df.show(2)
df.coalesce(1).write.parquet("Employees")
df_java = df.filter(df.stream == 'JAVA')
df_java.coalesce(1).write.parquet("JavaEmployees")
Statistical and Mathematical Functions with DataFrames in Apache Spark
In this hands-on, you will start working on PySpark basics. Follow the below instructions to complete the hands-on:
STEPS:
Step 1: Import the pyspark and SparkSession package.
Step 2: Create a SparkSession object.
Step 3: Create 10 random values as a column and name the column as rand1
Step 4: Create another 10 random values as column and name the column as rand2.
Step 4: Calculate the co-variance and correlation between these two columns.
Step5: Create a new Dataframe with Header names as "Stats" and "Value"
Step6: Fill the new Dataframe with the obtained values as "Co-variance" and "Correlation"
Step7: Save the resultant DataFrame to a CSV file with name Result
NOTE:
1. Use coalesce to store the data frame as a single partition.
2. Ensure to use the exact naming convention for the result files.
Steps to complete the Handson:
1. Run your solution using the RUN option from Project menu.
2. Run the tests to check using the TEST option from the Project menu.
#IMP : In order to execute your code in answer.py file, please open a new terminal and use the command "spark-submit answer.py"
Git Instructions
Use the following commands to work with this project
spark-submit answer.py
spark-submit score.py
bash install.sh
Solution
# Put your code here
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import rand
sc = SparkSession.builder.appName("Example App").getOrCreate()
df = sc.range(0, 10).withColumn('rand1', rand(seed=11)).withColumn('rand2', rand(seed=31))
Stat = Row("Stats", "Value")
stat1 = Stat("Co-variance", df.stat.cov('rand1', 'rand2'))
stat2 = Stat("Correlation", df.stat.corr('rand1', 'rand2'))
stat_df = sc.createDataFrame([stat1, stat2])
stat_df.write.csv("Result", header = True)
More operations in PySpark
In this hands-on, you will start working on PySpark basics. Follow the below instructions to complete the hands-on:
STEPS:
Step 1: Import the SparkSession package.
Step 2: Create a SparkSession object.
Step 3: Create a DataFrame with the following details under the headers as "ID", "Name","Age","Area of Interest"
Step 4: Fill the Dataframe with the following data:
- "1","Jack", 22,"Data Science"
- "2","Luke", 21,"Data Analytics"
- "3","Leo", 24,"Micro Services"
- "4","Mark", 21,"Data Analytics"
Step 5: Use describe method on Age column and observe the statistical parameters and save the data into a parquet file under the folder with name "Age" inside /projects/challenge/.
Step 6: Select the columns ID, Name, and Age, and Name should be in descending order. Save the resultant into a parquet file under the folder with name "NameSorted" inside /projects/challenge/.
NOTE:
1. Use coalesce to store the data frame as a single partition.
2. Ensure to use the exact naming convention for the result folders.
Steps to complete the Handson:
1. Run your solution using the RUN option from Project menu.
2. Run the tests to check using the TEST option from the Project menu.
Git Instructions
Use the following commands to work with this project
spark-submit answer.py
spark-submit score.py
bash install.sh
Solution
from pyspark.sql import SparkSession, Row
sc = SparkSession.builder.appName("Example App").getOrCreate()
Employee = Row("ID", "Name","Age","Area of Interest")
emp1 = Employee("1","Jack", 22,"Data Science")
emp2 = Employee("2","Luke", 21,"Data Analytics")
emp3 = Employee("3","Leo", 24,"Micro Services")
emp4 = Employee("4","Mark", 21,"Data Analytics")
empData = [emp1, emp2, emp3, emp4]
df = sc.createDataFrame(empData)
df.describe("Age").coalesce(1).write.parquet("Age")
df.sort("Name", ascending=False).coalesce(1).write.parquet("NameSorted")
0 comments:
Post a Comment