Class ParquetFileSplitter

  • All Implemented Interfaces:
    FileSplitter

    public class ParquetFileSplitter
    extends Object
    implements FileSplitter
    A class that splits Parquet file into chunks of certain size. Each chunk then contains one or more row group start and end positions in a Parquet file.
    • Constructor Detail

      • ParquetFileSplitter

        public ParquetFileSplitter​(org.apache.parquet.io.InputFile file)
        Creates a new instance of ParquetFileSplitter. It uses default chunk size of 64MB to split the file.
        Parameters:
        file - a Parquet file
      • ParquetFileSplitter

        public ParquetFileSplitter​(org.apache.parquet.io.InputFile file,
                                   long chunkSize)
        Creates a new instance of ParquetFileSplitter.
        Parameters:
        file - a Parquet file
        chunkSize - a chunk size in bytes
    • Method Detail

      • getRowGroupSplits

        protected List<ChunkInterval> getRowGroupSplits​(List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups)
        Returns row group splits given the file row groups.
        Parameters:
        rowGroups - a list of file row groups
        Returns:
        a list of ChunkIntervals