In this article, we will explore techniques for determining the size of tables without scanning the entire dataset using the Spark Catalog API.
Understanding table sizes is critical for optimizing query performance and resource allocation. While row counts provide an initial reference point, estimating sizes in bytes, kilobytes, megabytes, gigabytes, or terabytes enables informed decision-making.
Starting from version 2.0, Spark supports the catalog API, which includes useful methods such as listing tables and databases. Additionally, you can calculate the table size by summing up the individual file sizes within the underlying directory or utilize the queryExecution.analyzed.stats method.
Scala Code Example
I already have a db with the name as sampledb so will be using that for this example. You can change the name of the DB as per your requirement.
object GetTableSize {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder().appName("Table Size")
.getOrCreate()
val db_name = "sampledb"
val tb_list = spark.catalog.listTables(db_name).select("name")
tb_list.show()
for (row <- tb_list.rdd.collect)
{
println(row(0).toString + ": " +
((spark.read.table( db_name+"."+row(0).toString ).queryExecution.analyzed.stats.sizeInBytes)/1000)+"Kb")
}
}
}
Output
country: 2204
mockdata: 51