SMS character set handling and multipart messages

This guide gives you information about message length and character set handling of SMS messages. This is an important topic concerning costs. Please read through it to make sure you are able to configure the system according to your preferences.

Introduction

SMS is used to send text messages between mobile phones in most cases. When a text is transmitted, there is a size limitation on the message length. If English characters are used, the maximum message length is 160 characters. When international characters are sent, the maximum length is 70 characters. This size limit is determined by the character set used to transmit the message.

SMS segmentation and reassembely (SAR)

To increase the size limit of text messages, the SMS technology was improved to support longer text messages. This improvement is called as the multipart SMS technology. This technology referes to a so called segmentation and reassembly procedure. If an english text message, that is longer then 160 characters is sent, it is first segmented by the sending mobile and is transmitted through the GSM network in several SMS messages. The recipient mobile phone, after receiving all message parts reassembles the segments and displays the long text as a single message to the user. Of course if internetional characters are used the segmentation starts when a message text becomes longer then 70 characters.

When multipart technology is applied, the cost of each message can be calculated by the number SMS messages used to transmit the text over the wireless network. For example if a text message is 240 english characters it fits into two SMS, so the cost will be twice as much as a single 160 character SMS.

One might expect that if a single SMS can hold 160 characters, a 320 character message would take two physical SMSes. This is not the case. When multipart SMS technology is used only 153 characters fit into a single SMS, because some space is needed for the segmenation information, that can be used to reassemble the message parts in correct order. So if a 320 character message is sent, it would take 3 SMS. The first two would hold 153 characters, and the last one would hold 14 characters. For international characters, 67 characters fit into a multipart SMS segment.

Terms and definitions, SAR technology in detail

To be able to give more exact information, the terms and definitions need to be cleared. When I have mentioned english characters, I was refering to the 7 bit GSM SMS alphabet, that contains english characters and a few international characters for Western Europe and Greece. These characters are defined in the ETSI GSM 03.38 standard. When I have mentioned international characters, I have refered to the unicode character set. The unicode character set can be used to send special symbols and characters of all languages including chinese, arabic, hebrew, cyrillic, special eastern european characters, etc.

In GSM SMS system, an SMS message can contain up to 140 bytes (standard 8-bit bytes) of message data. The 7 bit SMS alphabet makes it possible to send 160 characters in this 140 bytes. This means that, when you send a text message, as long as the text only contains characters that are included in the GSM 7-bit character set, 160 7-bit characters are compressed into 140 8-bit bytes to produce the 160 character limit that we are so familiar with. (Note: 160 * 7 = 140 * 8).

It is worth noting that ETSI GSM 03.38 also defines a few characters that are represented by two 7-bit characters when included in a text message. A table in the URL referenced above shows these characters, but since there are only a few, I will also list them here: "^", "{", "}", "\", "[", "]", "~", "" and "'".

If you want to send a message that contains characters that are not part of the GSM 7-bit character set, such as Chinese, Arabic, Thai, Cyrillic, etc., then the entire text of the SMS that actually goes out over the air needs to be encoded in the Unicode UCS-2 character set. In the UCS-2 character set, each character is encoded with 16-bits (or two 8-bit bytes). This means that an SMS message is limited to 70 16-bit Unicode characters (70 * 16 = 140 * 8).

If a message is larger than 140 8-bit bytes, then there are segmentation and reassembly standards defined, where a single logical message can be sent over the air using multiple physical SMS messages. The receiving client then has the ability to reassemble the segmented message so that it again appears as a single message on the receiving device.

When a long text message is segmented into multiple physical SMS messages, a special header is added to each physical SMS message so that the receiving client knows that it is a multipart SMS message that must be reassembled by the client. These headers are known as segmentation or concatenation headers or SAR headers. The SAR headers are 6 bytes (8-bits each). They are included in each physical SMS message. These headers are placed in the User Data Header (UDH) field of the message, but they do count against the overall size limit of the message.

If you send a long text message containing only characters that are part of the GSM 03.38 character set, then each SMS segment can contain up to 153 characters. (140 bytes - 6 bytes for the concatenation header leaves 134 available bytes, or 7 * 134 = 1072 bits. The most 7-bit characters that can be packed into 1072 bits is 153.)

If you send a long text message that includes any characters that require Unicode encoding, then each SMS segment can contain up to 67 characters. (67 * 16 = 1072 bits)

Character conversions and character sets

When you use Ozeki NG SMS Gateway, you will send SMS messages from your PC. The character set in your PC is a Windows or Unix charset, and is not going to be the GSM 7 bit or the GSM unicode character set. For example you might use UTF 8, ISO-8859-1, ISO-8859-2. In all cases some kind of character conversion needs to take place to transfer your PC characters to the appropriate SMS characters. This conversion will determine the type of message (SMS with english characters or SMS with unicode characters) you can send through the GSM network. If this conversion is not handled carefully, you might run into extra costs.

Ozeki NG SMS Gateway

Ozeki NG SMS Gateway will perform the character set conversion for you according to the policy you select, and will do the segmentation and reassembly of long text messages accordingly. To choose a prefered conversion policy you find the following options in the "Charsets" tab of the configuration form of the SMS service provider connection (e.g.: In the GSM modem configuration form).

Best match: Convert to preferred character set if lossless conversion is possible. (Character substitutions are not allowed.)
Transform: Convert to preferred character set if possible. (Character substitutions are allowed.)
Enforce: Always use the preferred charset. (Character substitutions and character losses are allowed.)

These options along with the prefered character set setting allow you to configure the character set conversion. For example if you select "GSM 7 bit" as your prefered charset and select "Enforce" as the character set encoding policy on the configuration form of yous service provider connection (e.g.: GSM Modem configuration form) you can be sure, that only english (160 character) long message encoding will be used (Figure 1). This will mean lower message costs, but it will also mean, that some international characters and special symbols will not be displayed correctly on the recipient handset, because the GSM 7 bit alphabet does not have a corresponding character for all symbols.

selecting the enforce policy
Figure 1 - Selecting the enforce policy

More information