Atozed Forums

Full Version: Dealing with binaries and text in TIdHTTP
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
For context, I'm trying to create a simple abstraction over TIdHTP. So far I have something along these lines:

Code:
type
 THTTPRequest = record
   Method: (
     hmGET
     hmPOST,
     hmPUT
   );
   URL: string;
   ContentType: string;
   Body: string;
 end;

 THTTPResponse = record
   StatusCode: SmallInt;
   Body: string;
 end;

 THTTPClient = class
  public
    function Perform(const ARequest: THTTPRequest): THTTPResponse; virtual; abstract;
  end;

 TIndyHTTPClient = class(THTTPClient)
 public
   function Perform(const ARequest: THTTPRequest): THTTPResponse; override;
 end;

function TIndyHTTPClient.Perform(const ARequest: THTTPRequest): THTTPResponse;
var
 LClient: TIdHTTP;
 LRequestBodyStream: TStringStream;
begin
 Result := Default(THTTPResponse);
 LRequestBodyStream := TStringStream.Create(ARequest.Body);
 try
   LClient := TIdHTTP.Create(nil);
   try
     LClient.HandleRedirects := True;
     LClient.HTTPOptions := LClient.HTTPOptions + [hoNoProtocolErrorException];
     LClient.Request.ContentType := ARequest.ContentType;
     case ARequest.Method of
       hmGET:
         Result.Body := LClient.Get(ARequest.URL);
       hmPUT:
         Result.Body := LClient.Put(ARequest.URL, LRequestBodyStream);
       hmPOST:
         Result.Body := LClient.Post(ARequest.URL, LRequestBodyStream);
     end;
     Result.StatusCode := LClient.ResponseCode;
   finally
     LClient.Free;
   end;
 finally
   LRequestBodyStream.Free;
 end;
end;

The issue is that not all requests and responses will be text based. I'd like to keep the interface as uniform as possible. Although filling the Body of a THTTPRequest with TBytes converted by StringOf did not result in any instantly noticeable problem, it does worry me that unwanted conversions may happen that corrupt the data. Conversely, I'm not sure if BytesOf on the Body of a THTTPResponse will give me the binary response (in case of files) exactly as it was sent. I could switch to always using TBytes for both, but then I lose all encoding/decoding intelligence that Indy has.

What is the cleanest solution (i.e. smallest number of fields/types) that will work?
It totally depends on which features that you need. Why not simply use the request provided in TIdHTTP already?
It should be possible to do both binary and text based requests, as transparently as possible. I'm aiming for a small interface, decoupled from Indy, to keep my code isolated from third party components and have easier mocking/stubbing.

Text may be sent in an appropriate encoding decided by Indy (assuming it will indicate that encoding in the headers). The main thing here is that Indy must know if the Body contains actual text or just bytes, preferably without having to set that as a separate flag. Maybe setting the right ContentType will do, or making Body a RawByteString will make it so that assigning a regular String will mark it as text whereas BytesOf won't...?

When receiving the response, I'd like Indy to decode text but leave binary as is, based on the content type and encoding specified by the server. So, if possible, I would prefer the caller not having to indicate whether they are expecting text or binary. It's acceptable if the caller still needs to expect the response to decide whether to interpret the Body as text or binary, but in case of text they should not be required to decode it themselves. Again, preferably without a flag, but even if there would be a flag, I don't know based on what I could set it in the Perform method, i.e. if Indy can tell me whether it's likely text or binary that was received.

I hope this makes some sense.
It is clear that you don't understand how HTTP operates, or how TIdHTTP implements HTTP.

An HTTP message's body is just raw bytes. Various headers, including Content-Type, specify what the type, formatting, and transmission method of the body actually is.

TIdHTTP handles those details for you. However, you are responsible for choosing the appropriate overload of its Get()/Post()/Put() methods for not only the type of data you want to send, but also for the type of data you want to receive. TIdHTTP can send a request body from a TStrings or TStream, and can return a response body as a String or a TStream. There are many overloads to choose from:

Code:
procedure Get(AURL: string; AResponseContent: TStream); overload;
procedure Get(AURL: string; AResponseContent: TStream; AIgnoreReplies: array of Int16); overload;
function Get(AURL: string
  {$IFDEF STRING_IS_ANSI}; ADestEncoding: IIdTextEncoding = nil{$ENDIF}
): string; overload;
function Get(AURL: string; AIgnoreReplies: array of Int16
  {$IFDEF STRING_IS_ANSI}; ADestEncoding: IIdTextEncoding = nil{$ENDIF}
): string; overload;

function Post(AURL: string; const ASourceFile: String
  {$IFDEF STRING_IS_ANSI}; ADestEncoding: IIdTextEncoding = nil{$ENDIF}
): string; overload;
function Post(AURL: string; ASource: TStrings; AByteEncoding: IIdTextEncoding = nil
  {$IFDEF STRING_IS_ANSI}; ASrcEncoding: IIdTextEncoding = nil; ADestEncoding: IIdTextEncoding = nil{$ENDIF}): string; overload;
function Post(AURL: string; ASource: TStream
  {$IFDEF STRING_IS_ANSI}; ADestEncoding: IIdTextEncoding = nil{$ENDIF}
): string; overload;
function Post(AURL: string; ASource: TIdMultiPartFormDataStream
  {$IFDEF STRING_IS_ANSI}; ADestEncoding: IIdTextEncoding = nil{$ENDIF}
): string; overload;

procedure Post(AURL: string; const ASourceFile: String; AResponseContent: TStream); overload;
procedure Post(AURL: string; ASource: TStrings; AResponseContent: TStream; AByteEncoding: IIdTextEncoding = nil
  {$IFDEF STRING_IS_ANSI}; ASrcEncoding: IIdTextEncoding = nil{$ENDIF}); overload;
procedure Post(AURL: string; ASource, AResponseContent: TStream); overload;
procedure Post(AURL: string; ASource: TIdMultiPartFormDataStream; AResponseContent: TStream); overload;

function Put(AURL: string; ASource: TStream
  {$IFDEF STRING_IS_ANSI}; ADestEncoding: IIdTextEncoding = nil{$ENDIF}
): string; overload;
procedure Put(AURL: string; ASource, AResponseContent: TStream); overload;

In your THTTPRequest, using TBytes for the Body will work, but not very efficiently, especially when you start sending files, as TBytes requires its bytes to reside in memory. You don't want to load large files into memory. Not to mention, the extra overhead of requiring the user to convert their data to TBytes, and then you having to convert that to a format TIdHTTP will accept, such as a TStream (via TBytesStream, for instance), and vice versa for responses.

I would suggest using TStream for your Body data, then you can use TStringStream for text, TFileStream for files, TMemoryStream for blocks of memory, etc. You gain a lot more flexibility that way.

Also, when it comes to converting between bytes and strings, don't use BytesOf() and StringOf(), they don't offer you any control over the encoding used. Use TEncoding.GetBytes() and TEncoding.GetString() instead, or Indy's ToBytes() and BytesToString() functions (which offer encoding parameters) or use its IIdTextEncoding interface directly.
I do understand reasonably much about HTTP, but not how TIdHTTP abstracts over that. In particular, I did (or still do) not understand if/when Indy infers encodings, how it does so (e.g. looking at the headers of a request/response or at the data), if/when it does conversions or merely "advertises" the encoding (e.g. in the headers). I now dived a bit more into the code.

From what you are saying, I get that for methods use a TStream for the request body, Indy takes those bytes as-is. Then is any charset ever put in the headers automatically? If so, from where?

For responses, Indy will either write them as-is to the given TStream, or (using another overload) use ReadStringAsCharset decode them to String based on the charset provided in the headers or HTTP's default of ISO-8859-1 in case none is specified. Only if the content type is XML does it ignore all that and look at the actual data. Apart from that it does not look at content types to determine whether the response should be decoded at all (i.e. considered text) or not (i.e. considered binary). Correct?
(12-18-2018, 12:50 AM)thijsvandien Wrote: [ -> ]In particular, I did (or still do) not understand if/when Indy infers encodings, how it does so (e.g. looking at the headers of a request/response or at the data), if/when it does conversions or merely "advertises" the encoding (e.g. in the headers).

When sending a request, TIdHTTP doesn't really infer very much, it requires you to provide info up front, via the TIdHTTP.Request sub-properties, like ContentType and CharSet, etc. If you don't supply that, it may use some defaults, but not very many. For instance, a default Content-Type header is sent only when posting a TStrings or TIdMultipartFormDataStream object (where the Content-Type needs to be a specific value). Otherwise, no default Content-Type is sent at all, not even something like "application/octet-stream" (which the server should infer on its own when no Content-Type is present).

When reading a response, TIdHTTP looks at the Content-Type header (or, in the specific case of HTML or non-textual XML, at the body data itself) to determine the charset to assign to the TIdHTTP.Response.CharSet property. And then, if (and only if) you use an overload that returns the body data as a String, that charset is used to decode the body data into Unicode (in Delphi 2007 and earlier, that Unicode is then converted to ANSI for output). If you use an overload that returns the response body in a TStream instead, the body data is returned as-is in its raw form, and you would have to then process it yourself as needed.

(12-18-2018, 12:50 AM)thijsvandien Wrote: [ -> ]From what you are saying, I get that for methods use a TStream for the request body, Indy takes those bytes as-is.

Yes.

(12-18-2018, 12:50 AM)thijsvandien Wrote: [ -> ]Then is any charset ever put in the headers automatically? If so, from where?

There is no default TIdHTTP.Request.CharSet assigned for TStream data. You are responsible for specifying the TIdHTTP.Request.ContentType and TIdHTTP.Request.CharSet properties as needed.

If you do assign the TIdHTTP.Request.ContentType property manually, a default CharSet may be assigned, depending on the particular media type being specified. If the input string specifies a "charset" attribute explicitly, it is used as-is (unless it is blank). If no charset is specified, and only if the media type is a "text/..." type, then the default CharSet is "us-ascii" for XML, and "ISO-8859-1" for other types. But, if the media type is not a "text/..." type, no default CharSet is assigned at all.

(12-18-2018, 12:50 AM)thijsvandien Wrote: [ -> ]For responses, Indy will either write them as-is to the given TStream, or (using another overload) use ReadStringAsCharset decode them to String based on the charset provided in the headers or HTTP's default of ISO-8859-1 in case none is specified.

Or, if the response is HTML or XML, the charset is taken from the HTML/XML header instead. And ISO-8859-1 is not always the default charset used (see above).

Also, if no charset is specified by the response, and none can be inferred implicitly, then ReadStringAsCharset() will end up using Indy's built-in 8bit encoding instead, which will decode the raw bytes as-is to Unicode codepoints U+0000..U+00FF.

(12-18-2018, 12:50 AM)thijsvandien Wrote: [ -> ]Only if the content type is XML does it ignore all that and look at the actual data.

And HTML, too.

(12-18-2018, 12:50 AM)thijsvandien Wrote: [ -> ]Apart from that it does not look at content types to determine whether the response should be decoded at all (i.e. considered text) or not (i.e. considered binary). Correct?

Only when deciding where to get the value for the TIdHTTP.Response.CharSet property from, and if the response body then needs to be decoded to a String for output.
Thanks for your input. I've decided to let the caller deal with encoding and decoding for now.