Commit 8198be8f authored by Daniel Friesel's avatar Daniel Friesel
Browse files

Add faster mode with huffman -> code look-up table

2x to 5x speed-up at the cost of ~600B of RAM. Compile with -DDEFLATE_WITH_LUT
parent e522e5cd
Loading
Loading
Loading
Loading
Loading
+42 −27
Original line number Original line Diff line number Diff line
**zlib-deflate-nostdlib** provides a zlib decompressor (RFC 1950) and deflate
**zlib-deflate-nostdlib** provides a zlib decompressor (RFC 1950) and deflate
reader (RFC 1951) suitable for 8- and 16-bit microcontrollers. It works
reader (RFC 1951) suitable for 8- and 16-bit microcontrollers. It works fine on
fine on MCUs as small as ATMega328P (used, for example, in the Arduino Nano)
MCUs as small as ATMega328P (used, for example, in the Arduino Nano) and
and MSP430FR5994. It is compatible with both C (from c99 on) and C++.  Apart
MSP430FR5994. It is compatible with both C (from c99 on) and C++.  Apart from
from type definitions for (u)int8\_t, (u)int16\_t, and (u)int32\_t, which are
type definitions for (u)int8\_t, (u)int16\_t, and (u)int32\_t, which you can
typically provided by stdint.h, it has no external dependencies.
provide yourself if stdint.h is not available, it has no external dependencies.


zlib-deflate-nostdlib is focused on a low memory footprint. It is not optimized
zlib-deflate-nostdlib is focused on a low memory footprint and not on speed.
for speed and uses a pretty naive implementation right now.
Depending on architecture and compilation settings, it requires **1.6 to 2.6 kB
of ROM** and **0.5 to 1.2 kB of RAM**. Decompression speed ranges from **1 to 5
kB/s per MHz**.  See below for details and tunables.


Note: This library *inflates* (i.e., decompresses) data. The source files and
Note: This library *inflates* (i.e., decompresses) data. The source files and
API are named as such, as is the corresponding function in the original zlib
API are named as such, as is the corresponding function in the original zlib
@@ -105,42 +107,55 @@ is designed for. In that case, you are probably better off with


## Memory Requirements
## Memory Requirements


Excluding the decompressed data buffer, zlib-deflate-nostdlib needs about
Compilation with `-Os`. ROM/RAM values are rounded up to the next multiple of
2.5 kB of ROM and 500 Bytes of RAM. Actual values depend on the architecture,
16B and do not include the buffer for decompressede data.
see the tables below. ROM/RAM values are rounded up to the next multiple of
16B.


### default (no checksum verification)
### baseline (no checksum verification)


| Architecture | ROM | RAM |
| Architecture | ROM | RAM |
| :--- | ---: | ---: |
| :--- | ---: | ---: |
| 8-bit ATMega328P | 1824 B | 640 B |
| 8-bit ATMega328P | 1808 B | 640 B |
| 16-bit MSP430FR5994 | 2272 B | 448 B |
| 16-bit MSP430FR5994 | 2256 B | 448 B |
| 20-bit MSP430FR5994 | 2576 B | 464 B |
| 20-bit MSP430FR5994 | 2560 B | 464 B |
| 32-bit ESP8266 | 1888 B | 656 B |
| 32-bit ESP8266 | 1888 B | 656 B |
| 32-bit STM32F446RE (ARM Cortex M3) | 1600 B | 464 B |
| 32-bit STM32F446RE (ARM Cortex M3) | 1616 B | 464 B |


### compliant mode (-DDEFLATE\_CHECKSUM)
### compliant mode (-DDEFLATE\_CHECKSUM)


ROM = baseline + 150 to 300 B, RAM = baseline.

### faster mode (-DDEFLATE\_WITH\_LUT)

| Architecture | ROM | RAM |
| Architecture | ROM | RAM |
| :--- | ---: | ---: |
| :--- | ---: | ---: |
| 8-bit ATMega328P | 2032 B | 640 B |
| 8-bit ATMega328P | — | — |
| 16-bit MSP430FR5994 | 2560 B | 448 B |
| 16-bit MSP430FR5994 | 2896 B | 1088 B |
| 20-bit MSP430FR5994 | 2896 B | 464 B |
| 20-bit MSP430FR5994 | 3248 B | 1088 B |
| 32-bit ESP8266 | 2048 B | 656 B |
| 32-bit ESP8266 | 1856 B | 1296 B |
| 32-bit STM32F446RE (ARM Cortex M3) | 1782 B | 464 B |
| 32-bit STM32F446RE (ARM Cortex M3) | 1664 B | 1104 B |



## Performance
## Performance


Due to its focus on low RAM usage, zlib-deflate-nostdlib is very slow. Expect
Tested with text files of various sizes, minimum file size 500 bytes, maximum
about 1kB/s per MHz on 16-bit and 2kB/s per MHz on 32-bit architectures. Tested
file size determined by the amount of available RAM.
with text files of various sizes, minimum file size 500 bytes, maximum file

size determined by the amount of available RAM.
### baseline (no checksum verification)


| Architecture | Speed @ 1 MHz | Speed | CPU Clock |
| Architecture | Speed @ 1 MHz | Speed | CPU Clock |
| :--- | ---: | ---: | ---: |
| :--- | ---: | ---: | ---: |
| 8-bit ATMega328P | 1 kB/s | 10 .. 22 kB/s | 16 MHz |
| 8-bit ATMega328P | 1 kB/s | 10 .. 22 kB/s | 16 MHz |
| 16-bit MSP430FR5994 | 1 kB/s | 8..15 kB/s | 16 MHz |
| 16-bit MSP430FR5994 | 1 kB/s | 8..16 kB/s | 16 MHz |
| 20-bit MSP430FR5994 | 1 kB/s | 8..17 kB/s | 16 MHz |
| 20-bit MSP430FR5994 | 1 kB/s | 8..16 kB/s | 16 MHz |
| 32-bit ESP8266 | 1 .. 3 kB/s | 79..246 kB/s | 80 MHz |
| 32-bit ESP8266 | 1 .. 3 kB/s | 79..246 kB/s | 80 MHz |
| 32-bit STM32F446RE (ARM Cortex M3) | 1 .. 5 kB/s | 282..875 kB/s | 168 MHz |
| 32-bit STM32F446RE (ARM Cortex M3) | 1 .. 5 kB/s | 282..875 kB/s | 168 MHz |

### faster mode (-DDEFLATE\_WITH\_LUT)

| Architecture | Speed @ 1 MHz | Speed | CPU Clock |
| :--- | ---: | ---: | ---: |
| 8-bit ATMega328P | — | — | 16 MHz |
| 16-bit MSP430FR5994 | 2 kB/s | 22..37 kB/s | 16 MHz |
| 20-bit MSP430FR5994 | 2 kB/s | 20..34 kB/s | 16 MHz |
| 32-bit ESP8266 | 3 .. 8 kB/s | 234..671 kB/s | 80 MHz |
| 32-bit STM32F446RE (ARM Cortex M3) | 6 .. 17 kB/s | 986..2815 kB/s | 168 MHz |
+89 −3
Original line number Original line Diff line number Diff line
@@ -92,6 +92,11 @@ uint8_t deflate_hc_lengths[19];
 */
 */
uint8_t deflate_lld_lengths[318];
uint8_t deflate_lld_lengths[318];


#ifdef DEFLATE_WITH_LUT
uint16_t deflate_ll_codes[288];
uint16_t deflate_d_codes[30];
#endif

/*
/*
 * Bit length counts and next code entries for Literal/Length alphabet.
 * Bit length counts and next code entries for Literal/Length alphabet.
 * Combined with the code lengths in deflate_lld_lengths, these make up the
 * Combined with the code lengths in deflate_lld_lengths, these make up the
@@ -159,8 +164,14 @@ static uint16_t deflate_get_bits(uint8_t num_bits)
	return ret & deflate_bitmask(num_bits);
	return ret & deflate_bitmask(num_bits);
}
}


#ifdef DEFLATE_WITH_LUT
static void deflate_build_alphabet(uint8_t * lengths, uint16_t size,
				   uint8_t * bl_count, uint16_t * next_code,
				   uint16_t * codes)
#else
static void deflate_build_alphabet(uint8_t * lengths, uint16_t size,
static void deflate_build_alphabet(uint8_t * lengths, uint16_t size,
				   uint8_t * bl_count, uint16_t * next_code)
				   uint8_t * bl_count, uint16_t * next_code)
#endif
{
{
	uint16_t i;
	uint16_t i;
	uint16_t code = 0;
	uint16_t code = 0;
@@ -178,12 +189,28 @@ static void deflate_build_alphabet(uint8_t * lengths, uint16_t size,
		}
		}
	}
	}


	for (i = 1; i < max_len + 1; i++) {
	for (i = 1; i <= max_len; i++) {
		code = (code + bl_count[i - 1]) << 1;
		code = (code + bl_count[i - 1]) << 1;
		next_code[i] = code;
		next_code[i] = code;
	}
	}

#ifdef DEFLATE_WITH_LUT
	uint8_t j = 0;
	code = 0;
	for (j = 1; j <= max_len; j++) {
		for (i = 0; i < size; i++) {
			if (lengths[i] == j) {
				codes[code++] = i;
			}
		}
	}
#endif
}
}


#ifdef DEFLATE_WITH_LUT
static uint16_t deflate_huff(uint16_t * codes,
			     uint8_t * bl_count, uint16_t * next_code)
#else
/*
/*
 * This function trades speed for low memory requirements. Instead of building
 * This function trades speed for low memory requirements. Instead of building
 * an actual huffman tree (at a cost of about 650 Bytes of RAM), we iterate
 * an actual huffman tree (at a cost of about 650 Bytes of RAM), we iterate
@@ -192,8 +219,12 @@ static void deflate_build_alphabet(uint8_t * lengths, uint16_t size,
 */
 */
static uint16_t deflate_huff(uint8_t * lengths, uint16_t size,
static uint16_t deflate_huff(uint8_t * lengths, uint16_t size,
			     uint8_t * bl_count, uint16_t * next_code)
			     uint8_t * bl_count, uint16_t * next_code)
#endif
{
{
	uint16_t next_word = deflate_get_word();
	uint16_t next_word = deflate_get_word();
#ifdef DEFLATE_WITH_LUT
	uint16_t code = 0;
#endif
	for (uint8_t num_bits = 1; num_bits < 16; num_bits++) {
	for (uint8_t num_bits = 1; num_bits < 16; num_bits++) {
		uint16_t next_bits = deflate_rev_word(next_word, num_bits);
		uint16_t next_bits = deflate_rev_word(next_word, num_bits);
		if (bl_count[num_bits] && next_bits >= next_code[num_bits]
		if (bl_count[num_bits] && next_bits >= next_code[num_bits]
@@ -203,9 +234,11 @@ static uint16_t deflate_huff(uint8_t * lengths, uint16_t size,
				deflate_input_now++;
				deflate_input_now++;
				deflate_bit_offset -= 8;
				deflate_bit_offset -= 8;
			}
			}
#ifdef DEFLATE_WITH_LUT
			return codes[code + (next_bits - next_code[num_bits])];
#else
			uint8_t len_pos = next_bits;
			uint8_t len_pos = next_bits;
			uint8_t cur_pos = next_code[num_bits];
			uint8_t cur_pos = next_code[num_bits];
			// This is slow, but memory-efficient
			for (uint16_t i = 0; i < size; i++) {
			for (uint16_t i = 0; i < size; i++) {
				if (lengths[i] == num_bits) {
				if (lengths[i] == num_bits) {
					if (cur_pos == len_pos) {
					if (cur_pos == len_pos) {
@@ -214,20 +247,35 @@ static uint16_t deflate_huff(uint8_t * lengths, uint16_t size,
					cur_pos++;
					cur_pos++;
				}
				}
			}
			}
#endif
		} else {
#ifdef DEFLATE_WITH_LUT
			code += bl_count[num_bits];
#endif
		}
		}
	}
	}
	return 65535;
	return 65535;
}
}


#ifdef DEFLATE_WITH_LUT
static int8_t deflate_huffman(uint16_t * ll_codes, uint16_t * d_codes)
#else
static int8_t deflate_huffman(uint8_t * ll_lengths, uint16_t ll_size,
static int8_t deflate_huffman(uint8_t * ll_lengths, uint16_t ll_size,
			      uint8_t * d_lengths, uint8_t d_size)
			      uint8_t * d_lengths, uint8_t d_size)
#endif
{
{
	uint16_t code;
	uint16_t code;
	uint16_t dcode;
	uint16_t dcode;
	while (1) {
	while (1) {
#ifdef DEFLATE_WITH_LUT
		code =
		    deflate_huff(ll_codes, deflate_bl_count_ll,
				 deflate_next_code_ll);
#else
		code =
		code =
		    deflate_huff(ll_lengths, ll_size, deflate_bl_count_ll,
		    deflate_huff(ll_lengths, ll_size, deflate_bl_count_ll,
				 deflate_next_code_ll);
				 deflate_next_code_ll);
#endif
		if (code < 256) {
		if (code < 256) {
			if (deflate_output_now == deflate_output_end) {
			if (deflate_output_now == deflate_output_end) {
				return DEFLATE_ERR_OUTPUT_LENGTH;
				return DEFLATE_ERR_OUTPUT_LENGTH;
@@ -244,10 +292,17 @@ static int8_t deflate_huffman(uint8_t * ll_lengths, uint16_t ll_size,
			if (extra_bits) {
			if (extra_bits) {
				len_val += deflate_get_bits(extra_bits);
				len_val += deflate_get_bits(extra_bits);
			}
			}
#ifdef DEFLATE_WITH_LUT
			dcode =
			    deflate_huff(d_codes,
					 deflate_bl_count_d,
					 deflate_next_code_d);
#else
			dcode =
			dcode =
			    deflate_huff(d_lengths, d_size,
			    deflate_huff(d_lengths, d_size,
					 deflate_bl_count_d,
					 deflate_bl_count_d,
					 deflate_next_code_d);
					 deflate_next_code_d);
#endif
			uint16_t dist_val = deflate_distance_offsets[dcode];
			uint16_t dist_val = deflate_distance_offsets[dcode];
			extra_bits = deflate_distance_bits[dcode];
			extra_bits = deflate_distance_bits[dcode];
			if (extra_bits) {
			if (extra_bits) {
@@ -313,12 +368,21 @@ static int8_t deflate_static_huffman()
		deflate_lld_lengths[i] = 5;
		deflate_lld_lengths[i] = 5;
	}
	}


#ifdef DEFLATE_WITH_LUT
	deflate_build_alphabet(deflate_lld_lengths, 288, deflate_bl_count_ll,
			       deflate_next_code_ll, deflate_ll_codes);
	deflate_build_alphabet(deflate_lld_lengths + 288, 29,
			       deflate_bl_count_d, deflate_next_code_d,
			       deflate_d_codes);
	return deflate_huffman(deflate_ll_codes, deflate_d_codes);
#else
	deflate_build_alphabet(deflate_lld_lengths, 288, deflate_bl_count_ll,
	deflate_build_alphabet(deflate_lld_lengths, 288, deflate_bl_count_ll,
			       deflate_next_code_ll);
			       deflate_next_code_ll);
	deflate_build_alphabet(deflate_lld_lengths + 288, 29,
	deflate_build_alphabet(deflate_lld_lengths + 288, 29,
			       deflate_bl_count_d, deflate_next_code_d);
			       deflate_bl_count_d, deflate_next_code_d);
	return deflate_huffman(deflate_lld_lengths, 288,
	return deflate_huffman(deflate_lld_lengths, 288,
			       deflate_lld_lengths + 288, 29);
			       deflate_lld_lengths + 288, 29);
#endif
}
}


static int8_t deflate_dynamic_huffman()
static int8_t deflate_dynamic_huffman()
@@ -336,16 +400,29 @@ static int8_t deflate_dynamic_huffman()
		deflate_hc_lengths[deflate_hclen_index[i]] = 0;
		deflate_hc_lengths[deflate_hclen_index[i]] = 0;
	}
	}


#ifdef DEFLATE_WITH_LUT
	deflate_build_alphabet(deflate_hc_lengths,
			       sizeof(deflate_hc_lengths),
			       deflate_bl_count_ll, deflate_next_code_ll,
			       deflate_ll_codes);
#else
	deflate_build_alphabet(deflate_hc_lengths,
	deflate_build_alphabet(deflate_hc_lengths,
			       sizeof(deflate_hc_lengths),
			       sizeof(deflate_hc_lengths),
			       deflate_bl_count_ll, deflate_next_code_ll);
			       deflate_bl_count_ll, deflate_next_code_ll);
#endif


	uint16_t items_processed = 0;
	uint16_t items_processed = 0;
	while (items_processed < hlit + hdist) {
	while (items_processed < hlit + hdist) {
#ifdef DEFLATE_WITH_LUT
		uint8_t code = deflate_huff(deflate_ll_codes,
					    deflate_bl_count_ll,
					    deflate_next_code_ll);
#else
		uint8_t code =
		uint8_t code =
		    deflate_huff(deflate_hc_lengths, sizeof(deflate_hc_lengths),
		    deflate_huff(deflate_hc_lengths, sizeof(deflate_hc_lengths),
				 deflate_bl_count_ll,
				 deflate_bl_count_ll,
				 deflate_next_code_ll);
				 deflate_next_code_ll);
#endif
		if (code == 16) {
		if (code == 16) {
			uint8_t copy_count = 3 + deflate_get_bits(2);
			uint8_t copy_count = 3 + deflate_get_bits(2);
			for (uint8_t i = 0; i < copy_count; i++) {
			for (uint8_t i = 0; i < copy_count; i++) {
@@ -371,13 +448,22 @@ static int8_t deflate_dynamic_huffman()
		}
		}
	}
	}


#ifdef DEFLATE_WITH_LUT
	deflate_build_alphabet(deflate_lld_lengths, hlit,
			       deflate_bl_count_ll, deflate_next_code_ll,
			       deflate_ll_codes);
	deflate_build_alphabet(deflate_lld_lengths + hlit, hdist,
			       deflate_bl_count_d, deflate_next_code_d,
			       deflate_d_codes);
	return deflate_huffman(deflate_ll_codes, deflate_d_codes);
#else
	deflate_build_alphabet(deflate_lld_lengths, hlit,
	deflate_build_alphabet(deflate_lld_lengths, hlit,
			       deflate_bl_count_ll, deflate_next_code_ll);
			       deflate_bl_count_ll, deflate_next_code_ll);
	deflate_build_alphabet(deflate_lld_lengths + hlit, hdist,
	deflate_build_alphabet(deflate_lld_lengths + hlit, hdist,
			       deflate_bl_count_d, deflate_next_code_d);
			       deflate_bl_count_d, deflate_next_code_d);

	return deflate_huffman(deflate_lld_lengths, hlit,
	return deflate_huffman(deflate_lld_lengths, hlit,
			       deflate_lld_lengths + hlit, hdist);
			       deflate_lld_lengths + hlit, hdist);
#endif
}
}


int16_t inflate(unsigned char *input_buf, uint16_t input_len,
int16_t inflate(unsigned char *input_buf, uint16_t input_len,
+1 −1
Original line number Original line Diff line number Diff line
#!/bin/sh
#!/bin/sh


exec g++ -std=c++11 -Wall -Wextra -pedantic -I../src -o inflate inflate-app.c ../src/inflate.c
exec g++ -std=c++11 -O2 -Wall -Wextra -pedantic -I../src "$@" -o inflate inflate-app.c ../src/inflate.c
+1 −1
Original line number Original line Diff line number Diff line
#!/bin/sh
#!/bin/sh


# g++ as provided by Debian Buster (used for CI tests) does not support c++20
# g++ as provided by Debian Buster (used for CI tests) does not support c++20
exec g++ -std=c++2a -Wall -Wextra -pedantic -I../src -o inflate inflate-app.c ../src/inflate.c
exec g++ -std=c++2a -O2 -Wall -Wextra -pedantic -I../src "$@" -o inflate inflate-app.c ../src/inflate.c
+1 −1
Original line number Original line Diff line number Diff line
#!/bin/sh
#!/bin/sh


exec gcc -std=c11 -Wall -Wextra -pedantic -I../src -o inflate inflate-app.c ../src/inflate.c
exec gcc -std=c11 -O2 -Wall -Wextra -pedantic -I../src "$@" -o inflate inflate-app.c ../src/inflate.c
Loading