How to Read CSV file using Spark DataFrame

In this post i will try to explain how to read a csv file using spark and scala. Spark provides spark.read().csv(“path”) and spark.read.format(“csv”).load(“path”)

In this tutorial, you will learn how to read a single file, multiple files, and read all files in a directory into DataFrame using Scala.

In this example i have a mock_data_1.csv file with column header as  id, first_name, last_name, email, gender

id,first_name,last_name,email,gender
1,Rhea,Harling,rharling0@telegraph.co.uk,Female
2,Hewett,Clee,hclee1@blogger.com,Male
3,Orrin,Wickenden,owickenden2@uol.com.br,Male
4,Winifield,Le Surf,wlesurf3@pcworld.com,Male
5,Gerald,Buche,gbuche4@pbs.org,Male

1. Create a spark session

To create SparkSession in Scala or Python, you need to use the builder pattern method builder() and calling getOrCreate() method.
val spark = SparkSession.builder()
.appName("CSV file read")
.master("local[*]")
.getOrCreate()

2. Read CSV file into dataframe using csv function

Using spark instance created in above section
val df_csv = spark.read.csv("D:\\SampleData\\IPL\\matches.csv")
df_csv.printSchema()
Output: root |– id: string (nullable = true) |– first_name: string (nullable = true) |– last_name: string (nullable = true) |– email: string (nullable = true) |– gender: string (nullable = true)

3. Read CSV file into dataframe using format function

Spark provides a format(String) function which can be used to read the CSV file by passing “csv” as an argument to the function.

val df = spark.read.format("csv")
.load("D:\\SampleData\\MOCK_DATA.csv")
df.show()
Output:
+---+----------+------------+--------------------+----------+
|_c0|       _c1|        _c2 |               _c3  |      _c4 |
+---+----------+------------+--------------------+----------+
| id|first_name|   last_name|           email    |  gender  |
|  1|   Rhea   |    Harling |rharling0@telegra...|  Female  |
|  2|Hewett    |    Clee    |  hclee1@blogger.com|  Male    |
|  3| Orrin    |   Wickenden|owickenden2@uol.c...|  Male    |
|  4|Winifield |    Le Surf |wlesurf3@pcworld.com|  Male    |
|  5|Gerald    |    Buche   | gbuche4@pbs.org    |  Male    |
+---+----------+------------+--------------------+----------+

As we can see if we directly read a csv file which has header spark will try to ignore it and will create its own column name as _c0,_c1 etc. So to avoid such situation we can use options.

So to read the header from the csv file we needs to use option which reading the csv

4. Read csv with header

To read a csv file with header we need to enable header as true in option while reading the file.

//Using format function
val df = spark.read.format("csv")
.option("header", "true")
.load("D:\\SampleData\\mock_data\\mock_data_1.csv")
df.show()

// Using csv function to read the csv file
val df_csv = spark.read
.option("header","true")
.csv("D:\\SampleData\\mock_data\\mock_data_2.csv")
df_csv.show()

5. Read multiple csv file

Using spark.read().csv(“path”) and spark.read.format(“csv”).load(“path”) you can also read multiple CSV files, just pass all file names by separating comma as a path, for example :

// Read multiple csv file
val df_m = spark.read.format("csv")
.option("header", "true")
.load("D:\\SampleData\\mock_data\\mock_data_1.csv","D:\\SampleData\\mock_data\\mock_data_2.csv")
df_m.show()

6. Read Multiple file in a directory

Using spark.read().csv(“path”) and spark.read.format(“csv”).load(“path”) you can also read multiple CSV files, just pass the directory location instead of path.

// Read multiple csv file in a directory
val df_d = spark.read.format("csv")
.option("header", "true")
.load("D:\\SampleData\\mock_data\\")

Options while reading a csv file

Spark provides option/options while reading the file.You can use option to pass key value pair of the property or use options to pass a map.

header:

Should be used if the file contains a header. Its default value is false.

val df2 = spark.read.options(Map("header"->"true"))
.format("csv")
.load("src/main/resources/mock_data_1.csv")

val df3 = spark.read.options(Map("header"->"true"))
.csv("src/main/resources/mock_data_1.csv")

inferSchema:

InferSchema is used to identify the type of column based on the data. It’s default value is false.

val df2 = spark.read.options(Map(
"inferSchema"->"true","header"->"true"))
.format("csv")
.load("src/main/resources/mock_data_1.csv")

val df3 = spark.read.options(Map(
"inferSchema" -> "true","header"->"true"))
.csv("src/main/resources/mock_data_1.csv")

delimiter:

Delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to pipe (|), tab, space,hash(#) or any character using this option.

val df2 = spark.read.options(Map("inferSchema"->"true",
"delimiter"->",",
"header"->"true"))
.format("csv")
.load("src/main/resources/mock_data_1.csv")

val df3 = spark.read.options(Map("inferSchema" -> "true",
"delimiter" -> ",",
"header"->"true"))
.csv("src/main/resources/mock_data_1.csv")

nullValue:

nullValue option can be used to specify the string in the csv file to be treated as null.

For ex. In our mock data if i want to set the value as null for the string male so it should look something like this.

val df4 = spark.read.options(Map("inferSchema" -> "true","delimiter" -> ",",
"header" -> "true","nullValue" -> "Male"))
.csv("src/main/resources/mock_data_1.csv")

Leave a Reply

Your email address will not be published. Required fields are marked *