My Products
Help

Feedback wanted: fetching large data sets

by Marius Bancila

We know that some of you need to fetch large sets of data (tens or hundreds of thousands or maybe more records). An example could be fetching in one request orders, with order lines, customer and product information, and other related data. There are several problems in regard to this:

We are evaluating options for making it possible to fetch large sets of data without these issues. We would like to hear your feedback on the matter.

 

Due to time constraints, the operation cannot be executed synchronously. Therefore, such an API must be asynchronous. With the availability of web hooks in the near future, you'll be able to be notified that the operation completed without having to poll periodically.

 

The result can be made available in a file stored on a file server for a period of time. You would receive the URI back and then you can download it directly. The data format is a matter of discussion, whether this should be the JSON format returned from GraphQL or something else, such as CSV.

 

Let us know what are your requirements for large data sets scenarios and what are your expectations.

1 REPLY 1
Florian Haase
PARTNER

by Florian Haase

From chatGPT:

  1. CSV (Comma-Separated Values): CSV is a simple and widely supported file format that stores tabular data. It is human-readable and can be easily imported into various applications like spreadsheets and databases. However, CSV files can be large and may not be the most efficient option for very large datasets.

  2. JSON (JavaScript Object Notation): JSON is a lightweight and flexible format for representing structured data. It is commonly used for web APIs and data interchange. JSON files are human-readable, widely supported, and can handle complex nested data structures. However, they can also be large and may not be the most space-efficient format.

  3. Parquet: Parquet is a columnar storage file format designed for big data processing frameworks like Apache Hadoop and Apache Spark. It offers efficient compression and encoding, which can significantly reduce the file size and improve query performance, especially for analytics workloads. Parquet files are optimized for column-wise data access and are suitable for large datasets.

  4. Avro: Avro is a compact, efficient, and self-describing binary file format developed by Apache. It supports schema evolution, allowing you to add, remove, or modify fields without breaking compatibility with existing data. Avro files are suitable for large datasets and are commonly used in big data processing pipelines.

  5. HDF5 (Hierarchical Data Format): HDF5 is a versatile file format designed for storing and organizing large and complex datasets. It supports hierarchical data structures, compression, and efficient chunking for accessing subsets of data. HDF5 files are commonly used in scientific and numerical applications.

  6. Apache Arrow: Apache Arrow is an in-memory data format and columnar storage system that aims to provide a common data representation for different programming languages and systems. It is designed for high-performance analytics and can facilitate efficient data exchange between different frameworks and tools.

 

I'm not sure what the best would be here - but Json I think is both more stable (feks CRLF, CR, special characters and so on) and easier to handle (by JsonConverters) as CSV?

 

But I'm not sure about this bigdata-problem actually? Did you get many threads here about timeouts when handling many pages? Or is this a question about quota limitation?

 

Florian