In this post i will try to explain how to read a csv file using spark and scala. Spark provides spark.read().csv(“path”) and spark.read.format(“csv”).load(“path”)
In this tutorial, you will learn how to read a single file, multiple files, and read all files in a directory into DataFrame using Scala.
In this example i have a mock_data_1.csv file with column header as id, first_name, last_name, email, gender
id,first_name,last_name,email,gender
1,Rhea,Harling,rharling0@telegraph.co.uk,Female
2,Hewett,Clee,hclee1@blogger.com,Male
3,Orrin,Wickenden,owickenden2@uol.com.br,Male
4,Winifield,Le Surf,wlesurf3@pcworld.com,Male
5,Gerald,Buche,gbuche4@pbs.org,Male
1. Create a spark session
To create SparkSession in Scala or Python, you need to use the builder pattern method builder() and calling getOrCreate() method.val spark = SparkSession.builder()
.appName("CSV file read")
.master("local[*]")
.getOrCreate()
2. Read CSV file into dataframe using csv function
Using spark instance created in above sectionval df_csv = spark.read.csv("D:\\SampleData\\IPL\\matches.csv")
df_csv.printSchema()
3. Read CSV file into dataframe using format function
Spark provides a format(String) function which can be used to read the CSV file by passing “csv” as an argument to the function.
val df = spark.read.format("csv")
.load("D:\\SampleData\\MOCK_DATA.csv")
df.show()
Output:
+---+----------+------------+--------------------+----------+ |_c0| _c1| _c2 | _c3 | _c4 | +---+----------+------------+--------------------+----------+ | id|first_name| last_name| email | gender | | 1| Rhea | Harling |rharling0@telegra...| Female | | 2|Hewett | Clee | hclee1@blogger.com| Male | | 3| Orrin | Wickenden|owickenden2@uol.c...| Male | | 4|Winifield | Le Surf |wlesurf3@pcworld.com| Male | | 5|Gerald | Buche | gbuche4@pbs.org | Male | +---+----------+------------+--------------------+----------+
As we can see if we directly read a csv file which has header spark will try to ignore it and will create its own column name as _c0,_c1 etc. So to avoid such situation we can use options.
So to read the header from the csv file we needs to use option which reading the csv
4. Read csv with header
To read a csv file with header we need to enable header as true in option while reading the file.
//Using format function
val df = spark.read.format("csv")
.option("header", "true")
.load("D:\\SampleData\\mock_data\\mock_data_1.csv")
df.show()
// Using csv function to read the csv file
val df_csv = spark.read
.option("header","true")
.csv("D:\\SampleData\\mock_data\\mock_data_2.csv")
df_csv.show()
5. Read multiple csv file
Using spark.read().csv(“path”) and spark.read.format(“csv”).load(“path”) you can also read multiple CSV files, just pass all file names by separating comma as a path, for example :
// Read multiple csv file
val df_m = spark.read.format("csv")
.option("header", "true")
.load("D:\\SampleData\\mock_data\\mock_data_1.csv","D:\\SampleData\\mock_data\\mock_data_2.csv")
df_m.show()
6. Read Multiple file in a directory
Using spark.read().csv(“path”) and spark.read.format(“csv”).load(“path”) you can also read multiple CSV files, just pass the directory location instead of path.
// Read multiple csv file in a directory
val df_d = spark.read.format("csv")
.option("header", "true")
.load("D:\\SampleData\\mock_data\\")
Options while reading a csv file
Spark provides option/options while reading the file.You can use option to pass key value pair of the property or use options to pass a map.
header:
Should be used if the file contains a header. Its default value is false.
val df2 = spark.read.options(Map("header"->"true"))
.format("csv")
.load("src/main/resources/mock_data_1.csv")
val df3 = spark.read.options(Map("header"->"true"))
.csv("src/main/resources/mock_data_1.csv")
inferSchema:
InferSchema is used to identify the type of column based on the data. It’s default value is false.
val df2 = spark.read.options(Map(
"inferSchema"->"true","header"->"true"))
.format("csv")
.load("src/main/resources/mock_data_1.csv")
val df3 = spark.read.options(Map(
"inferSchema" -> "true","header"->"true"))
.csv("src/main/resources/mock_data_1.csv")
delimiter:
Delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to pipe (|), tab, space,hash(#) or any character using this option.
val df2 = spark.read.options(Map("inferSchema"->"true",
"delimiter"->",",
"header"->"true"))
.format("csv")
.load("src/main/resources/mock_data_1.csv")
val df3 = spark.read.options(Map("inferSchema" -> "true",
"delimiter" -> ",",
"header"->"true"))
.csv("src/main/resources/mock_data_1.csv")
nullValue:
nullValue option can be used to specify the string in the csv file to be treated as null.
For ex. In our mock data if i want to set the value as null for the string male so it should look something like this.
val df4 = spark.read.options(Map("inferSchema" -> "true","delimiter" -> ",",
"header" -> "true","nullValue" -> "Male"))
.csv("src/main/resources/mock_data_1.csv")