Skip to content

Commit 8ef4662

Browse files
committed
add blogpost
1 parent e631e92 commit 8ef4662

File tree

3 files changed

+271
-2
lines changed

3 files changed

+271
-2
lines changed

.github/workflows/pages.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ jobs:
3232
with:
3333
go-version: "1.22.5"
3434
- name: Install sitetools
35-
run: go install github.com/vinceanalytics/sitetools@7e112de4f637d6badac7011424e73637e50bd2ac
35+
run: go install github.com/vinceanalytics/sitetools@ca340be84746e9a08e4d0dcb55cecda0013bd646
3636
- name: Setup Pages
3737
id: pages
3838
uses: actions/configure-pages@v5

blog.json

+12-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,12 @@
1-
[]
1+
[
2+
{
3+
"title": "Cost of storing 1 Million events",
4+
"source": "blog/cost-of-storing-1-million-events.md",
5+
"link": "/blog/cost-of-storing-1-million-events",
6+
"author": {
7+
"name": "Geofrey Ernest",
8+
"social": "https://github.com/gernest"
9+
},
10+
"date": "2024-11-07"
11+
}
12+
]
+258
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
We rely heavily on [Serialized Roaring Bitmaps](https://github.com/dgraph-io/sroar)
2+
on all aspects of our storage layer. This post deep dives into worst case scenario
3+
of storing 1 million web analytics events.
4+
5+
6+
## Event model
7+
8+
These are properties belonging to the fundamental data model that `vince` stores.
9+
10+
```
11+
int64 timestamp = 1;
12+
int64 id = 2;
13+
int32 bounce = 3;
14+
bool session = 4;
15+
bool view = 5;
16+
int64 duration = 6;
17+
string browser = 19;
18+
string browser_version = 20;
19+
string city = 26;
20+
string country = 23;
21+
string device = 18;
22+
string domain = 25;
23+
string entry_page = 9;
24+
string event = 7;
25+
string exit_page = 10;
26+
string host = 27;
27+
string os = 21;
28+
string os_version = 22;
29+
string page = 8;
30+
string referrer = 12;
31+
string region = 24;
32+
string source = 11;
33+
string utm_campaign = 15;
34+
string utm_content = 16;
35+
string utm_medium = 14;
36+
string utm_source = 13;
37+
string utm_term = 17;
38+
```
39+
40+
**notes**:
41+
42+
- `timestamp` is never indexed , instead it is used as part of the key in key/value
43+
store to encode view information. By default we truncate to a minute resolution, but we generate
44+
different views in `minute`, `hour`, `day` and `week` which will not be discussed in this post
45+
we focus mainly on the default `minute` resolution.
46+
47+
- `bounce`: has three possible values `-1` , `0` and `1`.
48+
49+
By cost, we mean storage capacity, updating the bitmaps is very efficient and fast. I will leave analysis
50+
of updates to a future blog post, for now we only care about storage.
51+
52+
53+
From the model above we have four distict data types `bool` , `string`, `int32` and `int64`. all
54+
these data types are stored using serialized roaring bitmaps using different encoding schemes.
55+
56+
## Field data encodings
57+
58+
Each new event is asigned a unique, auto incrementing `uint64` id. We store this id with each field.
59+
Fields are stored separately in a way that processing is similar like any other columnar storage, where columns can be
60+
processed independently, reducing data scans by only reading data that is needed for analysis.
61+
62+
**Data is partitioned in a 1 Million id groups** We call these groups shards. Basically
63+
id `1 ... 1M` will belong to shard `0` , `1M ... 2M` will belong to shard `1` . By partitioning,
64+
we are able to skip 1M chunks the moment we know there is no interesting data in it, this
65+
ensures `vince` performs extremely well when breaking down massive dataset of historical data
66+
on commodity hardware.
67+
68+
In this blog post we will be using the first shard, `0`. But results should be the same regardless of the
69+
shard number.
70+
71+
72+
## Breakdown
73+
74+
**bool**
75+
76+
Worst case value for `bool` is `true` because we don't store `false` values.
77+
We encode this field as [equality encoded bitmap](https://docs.featurebase.com/docs/cloud/cloud-faq/cloud-faq-bitmaps-equality-encoded/)
78+
79+
```
80+
func Boolean() {
81+
ra := roaring.NewBitmap()
82+
for i := range Million {
83+
ra.Bool(i, true)
84+
}
85+
summary("bool", ra)
86+
}
87+
```
88+
89+
```
90+
> bool
91+
serialized compressed
92+
132 kB 6.5 kB
93+
```
94+
95+
**string**
96+
97+
We keep a separate mapping of `string => uint64` worst case is we store unique
98+
1 Million strings (impossible but lets find out).
99+
We encode this field as [equality encoded bitmap](https://docs.featurebase.com/docs/cloud/cloud-faq/cloud-faq-bitmaps-equality-encoded/)
100+
101+
```
102+
func String() {
103+
ra := roaring.NewBitmap()
104+
for i := range Million {
105+
ra.Mutex(i, i)
106+
}
107+
summary("string", ra)
108+
}
109+
```
110+
```
111+
> string
112+
serialized compressed
113+
151 MB 17 MB
114+
```
115+
116+
**int32**
117+
118+
We have `bounce` as `int32` but we only store `-1`, `0`, and `1`. The worst case
119+
is `-1`.
120+
We encode this field using [bit slice index](https://docs.featurebase.com/docs/cloud/cloud-faq/cloud-faq-bitmaps-bit-slice/).
121+
122+
```
123+
func Bounce() {
124+
ra := roaring.NewBitmap()
125+
for i := range Million {
126+
ra.BSI(i, -1)
127+
}
128+
summary("bounce", ra)
129+
}
130+
```
131+
132+
```
133+
> bounce
134+
serialized compressed
135+
395 kB 19 kB
136+
```
137+
138+
**int64**
139+
140+
Worst case 1 Million unique values.
141+
We encode this field using [bit slice index](https://docs.featurebase.com/docs/cloud/cloud-faq/cloud-faq-bitmaps-bit-slice/).
142+
143+
```
144+
func Int64() {
145+
ra := roaring.NewBitmap()
146+
for i := range Million {
147+
ra.BSI(i, int64(i))
148+
}
149+
summary("int64", ra)
150+
}
151+
```
152+
153+
```
154+
> int64
155+
serialized compressed
156+
2.5 MB 175 kB
157+
```
158+
159+
160+
## Conclusion
161+
162+
```
163+
> bool
164+
serialized compressed
165+
132 kB 6.5 kB
166+
> string
167+
serialized compressed
168+
151 MB 17 MB
169+
> bounce
170+
serialized compressed
171+
395 kB 19 kB
172+
> int64
173+
serialized compressed
174+
2.5 MB 175 kB
175+
```
176+
177+
`vince` is trully cost effective, we don't waste cpu or memory thanks to searilzed roaring bitmaps.
178+
This post focused on storage, we will talk about efficient cpu utilization on another post.
179+
180+
The numbers from above are worsk case, impossible scenarios. In real production workload
181+
all fields have very low cardinality except the `id` field and compression is applied at block
182+
level not individual bitmaps shrinking the storage costs even more.
183+
184+
In short, we advice you start with a cheap `$5` VPS to test vince before production deployment.
185+
Depending on the number of active websites and site traffick you can vertically scale resources
186+
(mainly cpu/ bandwidth), storage will never be an issue. You can happilly store data forever
187+
and go back in time whenever you feel like it.
188+
189+
**Full script**
190+
191+
```
192+
package main
193+
194+
import (
195+
"bytes"
196+
"fmt"
197+
"os"
198+
"text/tabwriter"
199+
200+
"github.com/dustin/go-humanize"
201+
"github.com/golang/snappy"
202+
"github.com/vinceanalytics/vince/internal/roaring"
203+
)
204+
205+
const Million uint64 = 1 << 20
206+
207+
func main() {
208+
Boolean()
209+
String()
210+
Bounce()
211+
Int64()
212+
}
213+
214+
func Boolean() {
215+
ra := roaring.NewBitmap()
216+
for i := range Million {
217+
ra.Bool(i, true)
218+
}
219+
summary("bool", ra)
220+
}
221+
222+
func String() {
223+
ra := roaring.NewBitmap()
224+
for i := range Million {
225+
ra.Mutex(i, i)
226+
}
227+
summary("string", ra)
228+
}
229+
230+
func Bounce() {
231+
ra := roaring.NewBitmap()
232+
for i := range Million {
233+
ra.BSI(i, -1)
234+
}
235+
summary("bounce", ra)
236+
}
237+
238+
func Int64() {
239+
ra := roaring.NewBitmap()
240+
for i := range Million {
241+
ra.BSI(i, int64(i))
242+
}
243+
summary("int64", ra)
244+
}
245+
246+
func summary(dataType string, ra *roaring.Bitmap) {
247+
fmt.Println(">", dataType)
248+
o := ra.ToBuffer()
249+
var b bytes.Buffer
250+
c := snappy.NewBufferedWriter(&b)
251+
c.Write(o)
252+
c.Close()
253+
w := tabwriter.NewWriter(os.Stdout, 0, 0, 1, ' ', tabwriter.AlignRight)
254+
fmt.Fprintln(w, "serialized\tcompressed\t")
255+
fmt.Fprintf(w, "%s\t%s\t\n", humanize.Bytes(uint64(len(o))), humanize.Bytes(uint64(b.Len())))
256+
w.Flush()
257+
}
258+
```

0 commit comments

Comments
 (0)