MIME entity parser

MIME entity parser
Prev	Chapter 74. Parsing MIME documents	Next

After combining x::mime::newline_iter, x::mime::bodystart_iter, and x::mime::header_iter, use the following iterators to form a complete parser for a non-compound MIME section (with some assistance from a x::mime::header_collector):

#include <x/mime/sectiondecoder.H>

std::string content_transfer_encoding;

x::mime::section_decoder decoder=
    x::mime::section_decoder::create(content_transfer_encoding,
                                     std::ostreambuf_iterator<char>(std::cout));

The first parameter to x::mime::section_decoder's create() is the value of the MIME “Content-Transfer-Encoding” header, like “quoted-printable”, or “base64” (all other values result in a non-transformative decoder). The second parameter is an output iterator over chars. x::mime::section_decoder is an output iterator over ints from, at a minimum, an x::mime::newline_iter that produces an output sequence of ints, demarcated by newlines, with a trailing eof().

The output iterator instance received by create() gets iterated over char that were decoded using the specified transfer encoding.

std::string content_transfer_encoding;
std::string charset;

x::mime::section_decoder decoder=
    x::mime::section_decoder::create(content_transfer_encoding,
                                     std::ostreambuf_iterator<char>(std::cout),
                                     charset, "UTF-8");

create() takes two optional parameters. For text MIME entities, the first optional parameter is the MIME entity's character set, from the “Content-Type” header. The second optional parameter is the application's character set. In addition to decoding the output sequence, x::mime::section_decoder transcodes the chars from MIME entity's character set to the application character set.

#include <x/mime/entityparser.H>

x::outputrefiterator<int> processor=x::mime::make_entity_parser(
    x::mime::header_collector::create(
        []
        (const std::string &name,
	 const std::string &name_lc,
	 const std::string &value)
        {
            // ...
        }),
        []
        {
            return body_iterator();
        }, x::mime::sectioninfo::create());

x::mime::make_entity_parser() combines x::mime::bodystart_iter, x::mime::header_iter, and a few other odds and ends. It instantiates an output iterator that expects to be iterated over a single MIME entity. x::mime::make_entity_parser returns a template class for an output iterator that's convertable to an x::outputrefiterator<int>, and which iterates over an output sequence of int values produced by x::mime::newline_iter.

x::mime::make_entity_parser() takes three parameters, and returns an output iterator over ints. The first parameter becomes an output iterator that gets iterated over the header portion of the MIME entity. The iterator constructed by x::mime::header_collector is a popular choice for the header iterator, since x::mime::make_entity_parser() already constructs an intermediate x::mime::header_iter anyway.

When x::mime::make_entity_parser()'s iterator iterates over an x::mime::body_start, the header iterator iterates over an x::mime::eof value and the second parameter to x::mime::make_entity_parser() gets invoked. The second parameter is a functor or a lambda that returns another output iterator over ints, which ends up iterating over the rest of the output sequence, after the x::mime::body_start.

x::mime::make_entity_parser() encapsulaets the typical control flow of collecting the headers of a MIME entity, then figuring out how to parse the entity's body. The standard approach is to have the header iterator collect the MIME entity's headers, then have the functor/lambda figure out what to do with this entity, and return an output iterator that implements what's to be done.

The third and the final parameter is a x::mime::sectioninfo. The output iterator returned by x::mime::make_entity_parser() updates this object as it iterates over its output sequence.

Note

x::mime::make_entity_parser() returns an output iterator, and that's pretty much it. The show starts only after it actually iterates over something that resembles a MIME entity. This has a couple of implications.

What both functors or lambdas capture, and whether by reference or by value, needs careful thought. The functors/lambdas do not get invoked by x::mime::make_entity_parser(). They get invoked, as appropriate, when the resulting output iterator actually iterates over something. This usually means capturing by value, preferrably a reference to a reference-counted object.
The values in the x::mime::sectioninfo also get updated only when the show gets on the road. x::mime::make_entity_parser() takes the x::mime::sectioninfo object, and saves it as part of the returned output iterator, which updates the MIME entity metadata in the x::mime::sectioninfo as the iterator iterates over the output sequence. The iterator's reference on the x::mime::sectioninfo object gets released only after the output iterator goes out of scope and gets destroyed.
The values in the x::mime::sectioninfo may be used only after the output sequence iterates over the eof value.

The following example shows how to decode a non-compound MIME entity.

examples/mime/bodydecoder.C

#include <x/mime/newlineiter.H>
#include <x/mime/headeriter.H>
#include <x/mime/bodystartiter.H>
#include <x/mime/headercollector.H>
#include <x/mime/sectiondecoder.H>
#include <x/mime/entityparser.H>
#include <x/mime/structured_content_header.H>
#include <x/chrcasecmp.H>
#include <iostream>

int main()
{
	std::string content_transfer_encoding;
	std::string content_type="text";
	std::string charset;

	auto info=x::mime::sectioninfo::create();

	auto processor=
		x::mime::make_entity_parser
		(x::mime::header_collector::create
		 ([&]
		  (const std::string &name,
		   const std::string &name_lc,
		   const std::string &value)
		{
			x::chrcasecmp::str_equal_to cmp;

			if (cmp(name, x::mime::structured_content_header
				::content_transfer_encoding))
			{
				content_transfer_encoding=
					x::mime
					::structured_content_header(value)
					.value;
			}

			if (cmp(name, x::mime::structured_content_header
				::content_type))
			{
				x::mime::structured_content_header hdr(value);

				content_type=hdr.mime_content_type();
				charset=hdr.charset("iso-8859-1");
			}
		}), [&]
		 {
			 typedef std::ostreambuf_iterator<char> dump_iter_t;

			 dump_iter_t dump_to_stdout(std::cout);

			 return content_type == "text"
				 ? x::mime::section_decoder
				 ::create(content_transfer_encoding,
					  dump_to_stdout,
					  charset,
					  "UTF-8")
				 : x::mime::section_decoder
				 ::create(content_transfer_encoding,
					  dump_to_stdout);
		 }, info);

	typedef x::mime::newline_iter<decltype(processor)>
		newline_iter_t;

	std::copy(std::istreambuf_iterator<char>(std::cin),
		  std::istreambuf_iterator<char>(),
		  newline_iter_t::create(processor))
		.get()->eof();

	std::cout << info->header_char_cnt << " bytes in the header, "
		  << info->body_char_cnt << " bytes in the body." << std::endl
		  << info->header_line_cnt << " lines in the header, "
		  << info->body_line_cnt << " lines in the body." << std::endl;
	if (info->no_trailing_newline)
		std::cout << "No trailing newline" << std::endl;
	return 0;
}

bodydecoder.C constructs a processor using a header collector that only cares about the “Content-Transfer-Encoding” and “Content-Type” headers, using a case-insensitive string comparison. Proper parsing of these structured MIME headers requires an x::mime::structured_content_header, even for the “Content-Transfer-Encoding”. This makes sure that any whitespace in the headers gets properly ignored. Once the MIME's body begins, the body iterator construction lambda instantiates a x::mime::section_decoder that outputs to a std::ostreambuf_iterator to std::cout. If “Content-Type” indicates that this is a text MIME entity, it also gets transcoded to UTF-8.


      

$ cat bodydecoder.txt

Subject: test

Content-Type: text/plain; charset=iso-8859-1

Content-Transfer-Encoding: quoted-printable



Hello=A0world!

$ ./bodydecoder <bodydecoder.txt

Hello world!

104 bytes in the header, 15 bytes in the body.

4 lines in the header, 1 lines in the body.

bodydecoder.C reads a non-compound MIME entity on standard input, and writes its decoded body to standard output.

In this example, it's ok for the lambas to capture the stack-scoped objects, content_transfer_encoding, content_type, and charset, by reference. Everything gets iterated, and everything goes out of scope and gets destroyed, before main() returns. In most situations, it will be somewhat difficult to capture much by reference, and they'll capture everthing by value. Reference-counted objects come in very handy, under these circumstances.

Note

The blank line that separates the header from the body is considered to be a part of the header portion of the MIME entity.

Prev	Up	Next
MIME section information	Table Of Contents	Creating MIME entity parsers