ArcticDB_demo_resample
View in Github | Open in Google ColabArcticDB Resample Demo¶
This demo notebook showcases the high-performance resample capability of ArcticDB.
This is what you need to know about it:
- It runs on-the-fly as part of the read
- This makes it much more efficient than Pandas on large datasets
- The usage is similar to the Pandas resample function
- You can apply multiple aggregators to each column
- It can be used for downsampling high frequency data and generating "bar" data (see example 4)
Setup¶
In [ ]:
Copied!
!pip install arcticdb
!pip install arcticdb
In [2]:
Copied!
import numpy as np
import pandas as pd
import arcticdb as adb
import numpy as np
import pandas as pd
import arcticdb as adb
In [3]:
Copied!
# object store
arctic = adb.Arctic("lmdb://arcticdb_resample")
# object store
arctic = adb.Arctic("lmdb://arcticdb_resample")
In [4]:
Copied!
# library
lib = arctic.get_library('resample', create_if_missing=True)
# library
lib = arctic.get_library('resample', create_if_missing=True)
Create Some Data¶
- timeseries with 12,000,000 rows and a 1-second index
- int, float, string columns
- write the data into ArcticDB
In [5]:
Copied!
# data for resampling
index = pd.date_range("1990-01-01", periods=12_000_000, freq="s")
int_data = np.arange(len(index), dtype=np.uint64)
float_data = np.round(np.random.uniform(95., 105., len(index)), 3)
letters = ['a','b','c','d','e','f','g']
mkt_data = pd.DataFrame(
index=index,
data={
"id": int_data,
"price": float_data,
"category": (letters*(len(index)//len(letters) + 1))[:len(index)]
}
)
# data for resampling
index = pd.date_range("1990-01-01", periods=12_000_000, freq="s")
int_data = np.arange(len(index), dtype=np.uint64)
float_data = np.round(np.random.uniform(95., 105., len(index)), 3)
letters = ['a','b','c','d','e','f','g']
mkt_data = pd.DataFrame(
index=index,
data={
"id": int_data,
"price": float_data,
"category": (letters*(len(index)//len(letters) + 1))[:len(index)]
}
)
In [6]:
Copied!
# view the first 10 rows of the data
mkt_data.head(10)
# view the first 10 rows of the data
mkt_data.head(10)
Out[6]:
id | price | category | |
---|---|---|---|
1990-01-01 00:00:00 | 0 | 95.176 | a |
1990-01-01 00:00:01 | 1 | 97.872 | b |
1990-01-01 00:00:02 | 2 | 104.930 | c |
1990-01-01 00:00:03 | 3 | 103.573 | d |
1990-01-01 00:00:04 | 4 | 97.052 | e |
1990-01-01 00:00:05 | 5 | 103.435 | f |
1990-01-01 00:00:06 | 6 | 99.339 | g |
1990-01-01 00:00:07 | 7 | 103.358 | a |
1990-01-01 00:00:08 | 8 | 104.301 | b |
1990-01-01 00:00:09 | 9 | 104.651 | c |
In [7]:
Copied!
# write the data into ArcticDB
sym = 'market_data'
lib.write(sym, mkt_data)
# write the data into ArcticDB
sym = 'market_data'
lib.write(sym, mkt_data)
Out[7]:
VersionedItem(symbol='market_data', library='resample', data=n/a, version=0, metadata=None, host='LMDB(path=~/arcticdb_resample)', timestamp=1718958796318913629)
1. Simple Resample¶
- Resample to 1-minute
- Use different aggregators
- Resample can be thought of as a time-based groupby
- The groups are all the rows within a time interval
- Run also in Pandas to compare performance and results
In [8]:
Copied!
# frequency and aggregator params
freq1 = '1min'
aggs1 = {'id': 'max', 'price': 'last', 'category': 'count'}
# frequency and aggregator params
freq1 = '1min'
aggs1 = {'id': 'max', 'price': 'last', 'category': 'count'}
In [9]:
Copied!
%%time
# create the resample query and apply it on the read
market_data_1min_df = lib.read(sym, lazy=True).resample(freq1).agg(aggs1).collect().data
print(len(market_data_1min_df))
market_data_1min_df.tail()
%%time
# create the resample query and apply it on the read
market_data_1min_df = lib.read(sym, lazy=True).resample(freq1).agg(aggs1).collect().data
print(len(market_data_1min_df))
market_data_1min_df.tail()
200000 CPU times: user 684 ms, sys: 251 ms, total: 935 ms Wall time: 171 ms
Out[9]:
id | price | category | |
---|---|---|---|
1990-05-19 21:15:00 | 11999759 | 104.106 | 60 |
1990-05-19 21:16:00 | 11999819 | 104.456 | 60 |
1990-05-19 21:17:00 | 11999879 | 95.570 | 60 |
1990-05-19 21:18:00 | 11999939 | 103.967 | 60 |
1990-05-19 21:19:00 | 11999999 | 97.899 | 60 |
In [10]:
Copied!
%%time
# read the full data set and resample in Pandas
full_df = lib.read(sym).data
market_data_1min_pd_df = full_df.resample(freq1).agg(aggs1)
print(len(market_data_1min_pd_df))
market_data_1min_pd_df.tail()
%%time
# read the full data set and resample in Pandas
full_df = lib.read(sym).data
market_data_1min_pd_df = full_df.resample(freq1).agg(aggs1)
print(len(market_data_1min_pd_df))
market_data_1min_pd_df.tail()
200000 CPU times: user 1.6 s, sys: 401 ms, total: 2 s Wall time: 1.15 s
Out[10]:
id | price | category | |
---|---|---|---|
1990-05-19 21:15:00 | 11999759 | 104.106 | 60 |
1990-05-19 21:16:00 | 11999819 | 104.456 | 60 |
1990-05-19 21:17:00 | 11999879 | 95.570 | 60 |
1990-05-19 21:18:00 | 11999939 | 103.967 | 60 |
1990-05-19 21:19:00 | 11999999 | 97.899 | 60 |
2. Multiple Aggregators per Column¶
- Similar to NamedAgg in Pandas
- Downsample to 5-minute frequency
- Apply both max and last aggregators to the price column
- For multiple aggregators, the syntax is
output_column_name: (input_column_name: aggregator)
In [11]:
Copied!
freq2 = '5min'
aggs2 = {'id': 'max', 'price_last': ('price' ,'last'), 'price_count': ('price' ,'count'), 'category': 'first'}
freq2 = '5min'
aggs2 = {'id': 'max', 'price_last': ('price' ,'last'), 'price_count': ('price' ,'count'), 'category': 'first'}
In [12]:
Copied!
%%time
lib.read(sym, lazy=True).resample(freq2).agg(aggs2).collect().data
%%time
lib.read(sym, lazy=True).resample(freq2).agg(aggs2).collect().data
CPU times: user 1.07 s, sys: 415 ms, total: 1.49 s Wall time: 151 ms
Out[12]:
id | category | price_count | price_last | |
---|---|---|---|---|
1990-01-01 00:00:00 | 299 | a | 300 | 102.172 |
1990-01-01 00:05:00 | 599 | g | 300 | 101.450 |
1990-01-01 00:10:00 | 899 | f | 300 | 96.718 |
1990-01-01 00:15:00 | 1199 | e | 300 | 96.345 |
1990-01-01 00:20:00 | 1499 | d | 300 | 98.955 |
... | ... | ... | ... | ... |
1990-05-19 20:55:00 | 11998799 | d | 300 | 100.277 |
1990-05-19 21:00:00 | 11999099 | c | 300 | 103.596 |
1990-05-19 21:05:00 | 11999399 | b | 300 | 96.182 |
1990-05-19 21:10:00 | 11999699 | a | 300 | 99.911 |
1990-05-19 21:15:00 | 11999999 | g | 300 | 97.899 |
40000 rows × 4 columns
3. Processing Pipeline: Chaining Operations¶
- Downsample to 2.5-minutes frequency
- Group the resampled data by the category column
- Aggregate the category groups using mean
In [14]:
Copied!
%%time
lib.read(sym, lazy=True).resample('2min30s').agg({'id': 'min', 'category': 'first'}).groupby('category').agg({'id': 'mean'}).collect().data
%%time
lib.read(sym, lazy=True).resample('2min30s').agg({'id': 'min', 'category': 'first'}).groupby('category').agg({'id': 'mean'}).collect().data
CPU times: user 1.12 s, sys: 309 ms, total: 1.43 s Wall time: 183 ms
Out[14]:
id | |
---|---|
category | |
a | 5999700.0 |
b | 5999925.0 |
d | 5999850.0 |
e | 6000075.0 |
f | 5999775.0 |
g | 6000000.0 |
c | 6000150.0 |
4. Example: OHLC (Open High Low Close) Bars¶
- Downsample to 5-minute frequency
- Use multiple aggregators on the price column
- This is a simple example of how to convert tick data to OHLC bar data
In [15]:
Copied!
freq_ohlc = '5min'
agg_ohlc = {
'open': ('price', 'first'),
'high': ('price', 'max'),
'low': ('price', 'min'),
'close': ('price', 'last')
}
freq_ohlc = '5min'
agg_ohlc = {
'open': ('price', 'first'),
'high': ('price', 'max'),
'low': ('price', 'min'),
'close': ('price', 'last')
}
In [16]:
Copied!
%%time
ohlc_5min_bars = lib.read(sym, lazy=True).resample(freq_ohlc).agg(agg_ohlc).collect().data
%%time
ohlc_5min_bars = lib.read(sym, lazy=True).resample(freq_ohlc).agg(agg_ohlc).collect().data
CPU times: user 1.26 s, sys: 492 ms, total: 1.75 s Wall time: 118 ms
In [17]:
Copied!
ohlc_5min_bars.head()
ohlc_5min_bars.head()
Out[17]:
close | low | high | open | |
---|---|---|---|---|
1990-01-01 00:00:00 | 102.172 | 95.076 | 104.992 | 95.176 |
1990-01-01 00:05:00 | 101.450 | 95.008 | 104.999 | 98.520 |
1990-01-01 00:10:00 | 96.718 | 95.053 | 104.990 | 103.959 |
1990-01-01 00:15:00 | 96.345 | 95.070 | 104.969 | 95.878 |
1990-01-01 00:20:00 | 98.955 | 95.011 | 104.983 | 103.538 |
Conclusion¶
We have demonstrated the following about the ArcticDB resample feature:
- Easy to use, especially if you already resample in Pandas
- Very high performance - in particular much faster than reading all the data then resampling in Pandas
- Can be combined with other query functions to build processing pipelines
- Can be used to generate timeseries bars