Parquet.Net 4.1.3 vs Parquet.Sharp 8.0.0 Performance
The official ParquetSharp docs claim that it’s about 8 times faster than Parquet.Net. It’s understandable, as ParquetSharp is a wrapper around C++ implementation. But I wanted to check myself.
Benchmarking was done properly (code below), i.e. data was prepared before running benchmarks, and everyone had an identical job to do - read and write files with different data types.
Tested on Windows 11 rig (Surface Book 3 i7). Here are raw
Results
Method | DataType | DataSize | Mode | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|
ParquetNet | float | 10 | read | 73.88 μs | 15.766 μs | 0.864 μs | 2.8076 | - | - | 11.49 KB |
ParquetSharp | float | 10 | read | 100.49 μs | 20.178 μs | 1.106 μs | 9.0332 | 1.7090 | - | 37.23 KB |
ParquetNet | float | 10 | write | 396.30 μs | 126.413 μs | 6.929 μs | 2.4414 | - | - | 11.63 KB |
ParquetSharp | float | 10 | write | 491.33 μs | 259.184 μs | 14.207 μs | 4.8828 | - | - | 20.63 KB |
ParquetNet | float | 100 | read | 75.12 μs | 17.052 μs | 0.935 μs | 2.9297 | - | - | 12.19 KB |
ParquetSharp | float | 100 | read | 100.42 μs | 4.586 μs | 0.251 μs | 9.1553 | 1.8311 | - | 37.58 KB |
ParquetNet | float | 100 | write | 534.55 μs | 566.419 μs | 31.047 μs | 3.4180 | - | - | 14 KB |
ParquetSharp | float | 100 | write | 624.27 μs | 542.110 μs | 29.715 μs | 4.8828 | - | - | 20.63 KB |
ParquetNet | float | 1000 | read | 173.66 μs | 14.614 μs | 0.801 μs | 5.6152 | - | - | 22.15 KB |
ParquetSharp | float | 1000 | read | 101.96 μs | 9.078 μs | 0.498 μs | 10.0098 | 1.9531 | - | 41.09 KB |
ParquetNet | float | 1000 | write | 576.80 μs | 70.744 μs | 3.878 μs | 9.7656 | - | - | 40.15 KB |
ParquetSharp | float | 1000 | write | 592.69 μs | 562.047 μs | 30.808 μs | 4.8828 | - | - | 20.63 KB |
ParquetNet | float | 1000000 | read | 12,020.33 μs | 23,014.402 μs | 1,261.497 μs | 156.2500 | 156.2500 | 156.2500 | 7827.37 KB |
ParquetSharp | float | 1000000 | read | 2,263.07 μs | 935.326 μs | 51.268 μs | 125.0000 | 117.1875 | 117.1875 | 3944.02 KB |
ParquetNet | float | 1000000 | write | 40,940.85 μs | 26,413.129 μs | 1,447.793 μs | 692.3077 | 692.3077 | 692.3077 | 30277.95 KB |
ParquetSharp | float | 1000000 | write | 45,095.70 μs | 135,818.351 μs | 7,444.662 μs | - | - | - | 20.75 KB |
ParquetNet | int | 10 | read | 90.33 μs | 163.148 μs | 8.943 μs | 2.6855 | - | - | 11.49 KB |
ParquetSharp | int | 10 | read | 101.88 μs | 22.113 μs | 1.212 μs | 9.1553 | 1.8311 | - | 37.6 KB |
ParquetNet | int | 10 | write | 525.79 μs | 65.553 μs | 3.593 μs | 2.4414 | - | - | 11.63 KB |
ParquetSharp | int | 10 | write | 553.13 μs | 86.676 μs | 4.751 μs | 4.8828 | - | - | 20.88 KB |
ParquetNet | int | 100 | read | 75.46 μs | 9.007 μs | 0.494 μs | 2.9297 | - | - | 12.19 KB |
ParquetSharp | int | 100 | read | 112.53 μs | 7.467 μs | 0.409 μs | 9.1553 | 1.8311 | - | 37.95 KB |
ParquetNet | int | 100 | write | 539.13 μs | 894.439 μs | 49.027 μs | 3.4180 | - | - | 14 KB |
ParquetSharp | int | 100 | write | 640.44 μs | 203.358 μs | 11.147 μs | 4.8828 | - | - | 20.88 KB |
ParquetNet | int | 1000 | read | 175.95 μs | 30.184 μs | 1.655 μs | 5.6152 | - | - | 22.15 KB |
ParquetSharp | int | 1000 | read | 106.26 μs | 16.655 μs | 0.913 μs | 10.1318 | 1.9531 | - | 41.47 KB |
ParquetNet | int | 1000 | write | 600.46 μs | 203.480 μs | 11.153 μs | 9.7656 | - | - | 40.15 KB |
ParquetSharp | int | 1000 | write | 747.07 μs | 300.856 μs | 16.491 μs | 4.8828 | - | - | 20.88 KB |
ParquetNet | int | 1000000 | read | 11,774.15 μs | 7,818.750 μs | 428.572 μs | 156.2500 | 156.2500 | 156.2500 | 7827.9 KB |
ParquetSharp | int | 1000000 | read | 2,863.95 μs | 559.571 μs | 30.672 μs | 117.1875 | 109.3750 | 109.3750 | 3944.37 KB |
ParquetNet | int | 1000000 | write | 37,166.73 μs | 17,563.943 μs | 962.739 μs | 714.2857 | 714.2857 | 714.2857 | 30277.97 KB |
ParquetSharp | int | 1000000 | write | 24,173.06 μs | 13,369.745 μs | 732.841 μs | - | - | - | 20.94 KB |
ParquetNet | str | 10 | read | 82.11 μs | 40.593 μs | 2.225 μs | 3.9063 | - | - | 16.28 KB |
ParquetSharp | str | 10 | read | 122.56 μs | 22.451 μs | 1.231 μs | 26.8555 | 6.5918 | - | 112.05 KB |
ParquetNet | str | 10 | write | 585.58 μs | 153.079 μs | 8.391 μs | 3.9063 | - | - | 17.11 KB |
ParquetSharp | str | 10 | write | 619.85 μs | 421.429 μs | 23.100 μs | 19.5313 | 3.9063 | - | 81.14 KB |
ParquetNet | str | 100 | read | 197.51 μs | 102.508 μs | 5.619 μs | 13.9160 | - | - | 50.01 KB |
ParquetSharp | str | 100 | read | 130.14 μs | 20.777 μs | 1.139 μs | 32.2266 | 10.4980 | - | 132.45 KB |
ParquetNet | str | 100 | write | 2,023.54 μs | 2,064.824 μs | 113.180 μs | 13.6719 | - | - | 56.64 KB |
ParquetSharp | str | 100 | write | 795.58 μs | 2,056.523 μs | 112.725 μs | 20.5078 | 3.9063 | - | 87.21 KB |
ParquetNet | str | 1000 | read | 514.14 μs | 537.780 μs | 29.478 μs | 64.4531 | 32.2266 | 32.2266 | 355.95 KB |
ParquetSharp | str | 1000 | read | 236.97 μs | 298.893 μs | 16.383 μs | 71.2891 | 23.4375 | - | 336.35 KB |
ParquetNet | str | 1000 | write | 1,378.53 μs | 1,838.067 μs | 100.751 μs | 72.2656 | 72.2656 | 72.2656 | 395.21 KB |
ParquetSharp | str | 1000 | write | 1,005.94 μs | 874.443 μs | 47.931 μs | 48.8281 | 9.7656 | - | 206.29 KB |
ParquetNet | str | 1000000 | read | 373,941.10 μs | 176,271.747 μs | 9,662.049 μs | 36000.0000 | 18000.0000 | 1000.0000 | 339864.13 KB |
ParquetSharp | str | 1000000 | read | 349,504.53 μs | 192,279.577 μs | 10,539.492 μs | 36000.0000 | 18000.0000 | 1000.0000 | 226691.53 KB |
ParquetNet | str | 1000000 | write | 623,253.27 μs | 369,838.885 μs | 20,272.117 μs | - | - | - | 390525.34 KB |
ParquetSharp | str | 1000000 | write | 262,019.10 μs | 247,550.530 μs | 13,569.080 μs | 11500.0000 | 1000.0000 | - | 110924.43 KB |
Summary
On smaller datasets Parquet.Net is actually 10-30% faster which is explainable due to code being well written. On large dataset i.e 1 million rows results are:
- integer type: reading is 4 times slower than ParquetSharp; writing is 1.5 slower.
- float type: reading is 5 times slower; writing is almost identical.
- string: reading is almost identical; writing is 2.3 times slower.
So yes, Parquet.Net 4.1.3 is slower, but not 8 times. Also you get safe code implemented purely in .NET and working on all platforms unlike the other.
But I’d watch this space, because Parquet.Net 4.2 will start bringing some performance improvements which may put it on par or even make it faster. The main reason we are slow is due to relying on BinaryWriter
and BinaryReader
binary encoders which are not optimised for heavy workloads. Early tests show that eliminating them makes Parquet.Net faster than ParquetSharp, so these improvements may come in soon.
Benchmarking Code
#LINQPad optimize+
void Main()
{
Util.AutoScrollResults = true;
// generate int and string samples
BenchmarkRunner.Run<ParquetBenches>();
}
[ShortRunJob]
[MarkdownExporter]
[MemoryDiagnoser]
public class ParquetBenches {
public string ParquetNetFilename;
public string ParquetSharpFilename;
[Params("int", "str", "float")]
public string DataType;
[Params(10, 100, 1000, 1000000)]
//[Params(10)]
public int DataSize;
[Params("write", "read")]
public string Mode;
private static Random random = new Random();
public static string RandomString(int length) {
const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
return new string(Enumerable.Repeat(chars, length)
.Select(s => s[random.Next(s.Length)]).ToArray());
}
private DataColumn _pqnc;
private Schema _pqns;
private MemoryStream _pnqMs;
private Column[] _pqss;
private object _pqsd;
private Action<RowGroupWriter> _pqsWriteAction;
private Action<RowGroupReader, int> _pqsReadAction;
[GlobalSetup]
public async Task Setup() {
switch (DataType) {
case "int":
_pqnc = new DataColumn(new DataField<int>("c"), Enumerable.Range(0, DataSize).ToArray());
_pqss = new Column[] { new Column<int>("c") };
_pqsd = (int[])_pqnc.Data;
_pqsWriteAction = w => {
using (var colWriter = w.NextColumn().LogicalWriter<int>()) {
colWriter.WriteBatch((int[])_pqsd);
}
};
_pqsReadAction = (r, n) => {
var data = r.Column(0).LogicalReader<int>().ReadAll(n);
};
break;
case "str":
_pqnc = new DataColumn(new DataField<string>("c"), Enumerable.Range(0, DataSize).Select(i => RandomString(100)).ToArray());
_pqss = new Column[] { new Column<string>("c") };
_pqsd = (string[])_pqnc.Data;
_pqsWriteAction = w => {
using (var colWriter = w.NextColumn().LogicalWriter<string>()) {
colWriter.WriteBatch((string[])_pqsd);
}
};
_pqsReadAction = (r, n) => {
var data = r.Column(0).LogicalReader<string>().ReadAll(n);
};
break;
case "float":
_pqnc = new DataColumn(
new DataField<float>("f"), Enumerable.Range(0, DataSize).Select(i => (float)i).ToArray());
_pqss = new Column[] { new Column<float>("f") };
_pqsd = (float[])_pqnc.Data;
_pqsWriteAction = w => {
using (var colWriter = w.NextColumn().LogicalWriter<float>()) {
colWriter.WriteBatch((float[])_pqsd);
}
};
_pqsReadAction = (r, n) => {
var data = r.Column(0).LogicalReader<float>().ReadAll(n);
};
break;
case "date":
_pqnc = new DataColumn(
new DataField<DateTimeOffset>("dto"),
Enumerable.Range(0, DataSize).Select(i => (DateTimeOffset)DateTime.UtcNow.AddSeconds(i)).ToArray());
_pqss = new Column[] { new Column<DateTimeOffset>("dto") };
_pqsd = (DateTimeOffset[])_pqnc.Data;
_pqsWriteAction = w => {
using (var colWriter = w.NextColumn().LogicalWriter<DateTimeOffset>()) {
colWriter.WriteBatch((DateTimeOffset[])_pqsd);
}
};
_pqsReadAction = (r, n) => {
var data = r.Column(0).LogicalReader<DateTimeOffset>().ReadAll(n);
};
break;
default:
throw new NotImplementedException();
}
_pqns = new Schema(_pqnc.Field);
_pnqMs = new MemoryStream(1000);
ParquetNetFilename = $"c:\\tmp\\parq_net_benchmark_{Mode}_{DataSize}_{DataType}.parquet";
ParquetSharpFilename = $"c:\\tmp\\parq_sharp_benchmark_{Mode}_{DataSize}_{DataType}.parquet";
if (Mode == "read") {
using (Stream fileStream = File.Create(ParquetNetFilename)) {
using (var writer = await ParquetWriter.CreateAsync(_pqns, fileStream)) {
writer.CompressionMethod = CompressionMethod.None;
// create a new row group in the file
using (ParquetRowGroupWriter groupWriter = writer.CreateRowGroup()) {
await groupWriter.WriteColumnAsync(_pqnc);
}
}
}
}
}
[Benchmark]
public async Task ParquetNet() {
if (Mode == "write") {
using (Stream fileStream = File.Create(ParquetNetFilename)) {
using (var writer = await ParquetWriter.CreateAsync(_pqns, fileStream)) {
writer.CompressionMethod = CompressionMethod.None;
// create a new row group in the file
using (ParquetRowGroupWriter groupWriter = writer.CreateRowGroup()) {
await groupWriter.WriteColumnAsync(_pqnc);
}
}
}
} else if(Mode == "read") {
using(var reader = await ParquetReader.CreateAsync(ParquetNetFilename)) {
await reader.ReadEntireRowGroupAsync();
}
}
}
[Benchmark]
public async Task ParquetSharp() {
//https://github.com/G-Research/ParquetSharp#low-level-api
if (Mode == "write") {
using (var writer = new ParquetFileWriter(ParquetSharpFilename, _pqss, Compression.Uncompressed)) {
using (RowGroupWriter rowGroup = writer.AppendRowGroup()) {
_pqsWriteAction(rowGroup);
}
}
} else if(Mode == "read") {
using(var reader = new ParquetFileReader(ParquetNetFilename)) {
using(var g = reader.RowGroup(0)) {
int n = checked((int) g.MetaData.NumRows);
_pqsReadAction(g, n);
}
}
}
}
}
To contact me, send an email anytime or leave a comment below.