Parquet vs. CSV: A Comparison of File Formats for Data Storage with Experiment

Ahmet Emre Usta
5 min readJan 26, 2023

--

In today’s world, we are constantly generating and collecting large amounts of data. Data are generated at an unprecedented rate from social media to e-commerce. As a result, storing, managing, and analyzing this data has become increasingly important. One of the main challenges in this process is to find the correct file format for storing data. In this blog post, we discuss the advantages of using the Parquet file format over the commonly used CSV file format.

First, let’s take a look at what a CSV file is. CSV stands for comma-separated values and is a simple file format used to store data in tabular form, where each row represents a record and each column represents a field within that record. The data within a CSV file are separated by commas, making them easy to read and write. Owing to their simplicity, CSV files are widely used to transfer data between different systems and can be opened and edited using a wide variety of software, including spreadsheet programs such as Microsoft Excel and Google Sheets, as well as text editors such as Notepad.

Photo by Krzysztof Kowalik on Unsplash

Now, let us consider what a parquet file is. Parquet is a columnar storage file format that is optimized for use with big data processing frameworks, such as Apache Hadoop and Apache Spark. It is designed to be highly efficient for both reading and writing large datasets, and supports a wide range of data types, including primitive types such as integers and strings, as well as more complex types such as nested structures and lists.

Apache Parquet
Picture from link

Advantages

One of the main advantages of using Parquet over CSV is its columnar storage format. Columnar storage is more efficient for read-heavy workloads and analytical queries. Because the data are stored in columns, they can be read and processed much faster than row-based formats, such as CSV. This is particularly beneficial for large datasets and use cases that involve running analytical queries on the data.

Another advantage of the Parquet is its built-in support for compression and encoding. Parquet supports various compression and encoding algorithms, which can greatly reduce the amount of disk space required to store data and improve I/O performance. This is particularly beneficial when working with large datasets, as it can significantly reduce the amount of storage space required and improve the query performance.

Parquet also supports predicate pushdown, which allows filtering of large datasets before they are read into memory. This can improve query performance and reduce the amount of memory required to process the data. In addition, Parquet supports schema evolution, which means that the schema of the data can change over time without requiring a full rewrite of the data. This makes it more flexible than other file formats, and allows for easy integration with systems that require frequent updates to the data schema.

Disadvantages

However, the use of Parquet over CSV also has some disadvantages. However, one of the main disadvantages of this method is its complexity. Parquet is a more complex file format than CSV, and may be harder to use for some users, especially those without experience working with big data or columnar storage formats. Additionally, while Parquet is optimized for read-heavy workloads, its write performance may not be as good as that of row-based formats, such as CSV. This makes it less suitable for use cases that involve frequent updating or appending of data.

Another disadvantage of Parquet is that it is not designed for real-time data-processing. If low latency is a requirement for your use case, Parquet may not be the best choice. Additionally, Parquet is not well suited for streaming data and may not work well with streaming data processing systems such as Apache Kafka. Furthermore, Parquet files are not as easily human-readable as CSV files and might require specialized software to open and edit.

Photo by Mika Baumeister on Unsplash

Experiment

Let us come to a study that gave me the idea of ​​writing this blog post. I started a project based on the BFRSS data published annually by the CDC. BFRSS data are shared in ASCII format. The shared file is approximately 40 MB in zip form. After unzipping, the file size was 870 MB in raw ASCII form. The SAS code required to parse the data is available at the data site. After downloading the code, the raw data were parsed into columns using SAS Studio. The data divided into columns are saved in the sas7bdat format, which is not used frequently. This conversion process is explained in another blog post.

To process and store this data easily, it is necessary to save it in more familiar file formats after reading it in Python using the sas7bdat library. I am normally used to working with CSV files. However, owing to the size of the data, I searched for something other than the CSV. I wanted to give me a chance to parquet, which I heard was good with file compression. When I saved the 1 GB file with the sas7bdat extension as CSV using pandas, it took up 376 MB, but when I saved it in the parquet format, it took up only 34 MB.

The size difference is more than ten times.

That makes huge advantage. The advantages provided during reading and writing remained positive even after storage. Because we will work on data as a team, the small size is a big advantage for us. I am attaching the notebook file that I have completed for you to examine in more detail.

parquet Notebook

Conclusion

Although Parquet has many advantages over CSV, it is not the best choice for every use case. CSV is a simple and widely used file format that is well suited for small-to medium-sized datasets and for use cases that involve frequent updates or append to the data. On the other hand, Parquet is a more complex file format that is optimized for big data processing and for use cases that involve running analytical queries on large datasets. It is important to consider the specific requirements of your use case before choosing between the two file formats.

--

--