Email address input

When I initially wrote the new PHPMailer Pro, the goal was to make it as user friendly as possible. Looking through support requests, many of the new features included detection of the mail server, port, and protocol. 

Much of the challenge was around user input of email addresses. Much of the issues were based on:

  • input name first email address second, or is it email address first and name second?
  • input email address with or without chevrons ("<" and ">")?
  • input multiple names separated by commas or semi-colons?
  • input multiple names as arrays? which type of arrays?

Once users made errors, they learned the rigid way of input ... my task was to make that even simpler and ease the learning curve. I'm now writing a new class and need to handle email addresses for mailing lists and for email transport.

The first task is to define all the ways users could input ... and this would end up as part of the development plan:

  1. Email addresses could be input as "name address" or "address name" and with or without chevrons ... as well as using separators of their choice (comma or semi-colon)
  2. Strings. User input could be as strings, separated by commas (preferably) or semi-colons. 
  3. Arrays. User input could be as arrays, in any order, and any type of array:
    • Indexed array
    • Associative array
    • Multi-dimension array

To handle all this variety of input, I had to write some new functions (built into the class) to detect the type of array and distinguish between all the various types. The only help from PHP was theis_string() array and the in_array()function.

To continue with this article, we need to use some different wording ... wording that will end up making it into the new class. User input for email address portion will be "address" and name portion will be "display". You'll get it as we go further into the article.

My first attempt at the new software was to take the input and create the RFC formatted string as well as return the input in the same type as the user input but formatted as RFC. The intent was to help the user support mailing lists and get them in a corrected format. This ended up being a support nightmare with too many intermediate processes required. 

The second, and last attempt, is to create array properties to store the user input with the key being theaddress and the value being thedisplay. This becomes the really easy way to detect (and prevent) duplicates. It's also easy to limit the input to standards: max 500 email addresses for combined To, Cc, and Bcc; up to 500 email addresses for Reply-To; up to 500 email addresses for From; and max 1 email address for each of Sender and Return-Path.

Still, a nice plan, but we need to deal with user input. I started with string user input, the most basic input type so that I could get my new methods (functions) working right. I first needed to detect the correct order of input:address display,or,display address. Sounds simple. However, the display portion could include quotation marks (as in RFC) and other special characters. Addresses could input domains names, sub-domains, multiple dots in the domain portion and multiple dots and other separators in the local part. Ah, an address has three parts, one of those is the separator "@". On the left is the local part, on the right is the domain part. The RFC covering the domain part has a limit of 63 octets, although 24 appears to be the maximum at this point in time. Still need to allow for 63 octets because that's what the RFC calls for.

There are many differing opinions on working with email addresses with most developers creating a function to detect thedisplay portion and a separate function to detect theaddress portion. Both functions with exotic regex patterns. I already have a regex pattern to detect theaddress portion ... all I had to do was modify that to support up to 63 octets. Done. 

For thedisplay portion, there really isn't need for a separate function. Creating a separate function would be a lot of un-needed complexity. Think of it, the easiest way to get to thedisplay portion is to take the overall input and remove the email address that we found with the method above. 

To illustrate, let's start with an example:
First Last first.last@sub.example.com
we can read it properly: thedisplay portion is "First Last" and theaddress portion is "first.last@sub.example.com".

My development plan is to create a method (function) to get the email address. Next, remove the address from the overall user input string. What's left (trimmed) is the display portion. Trimmed means remove the spaces at the left and right sides of the remaining string.

While creating the address method, it was easy to also create two additional methods. One to get (and validate) the domain name and one to get the hostname.

Many developers try to create regex to get the domain name. Not me. There are way too many pitfalls trying to create a regex that handles the basic two element names and handle country specific three (or more) element names. My process is simpler. Let me describe it. First step is to validate the domain as found (in our example above, that would be test "sub.example.com"). If it doesn't validate (which usually means the address portion includes a subdomain), remove the first part (at the left). In our example above that would mean remove "sub" and test "example.com". I can't think of one single example where this wouldn't work to get the validated domain name.

Now I have the full address, the full domain, and the validated domain name. Remove the full address, trim the remainder, and I have the display portion. With the address and display portions separated, I can create the basic array in the format: $array[$address] = $display.

Now that I had the core of working with basic input, I could focus on other types of input. That means arrays.

PHP'sis_array() function detects whether a variable is an array and returns true if the variable is an array and false if it isn't. There is nothing in PHP's functions to detect the type of array. I ended up writing three new methods. One to detect if an array is "indexed" ... that is where the key is numeric. The second is to detect "associative" arrays ... that is where the key is a string and usually associated with the value. An example is $array['John Doe'] = "Fake Name" ... one problem of detecting associative array is that the key can be anything including numeric. The third is to detect "multi-dimensional" arrays ... that is where the value of a key is an array of its own. 

To accomplish this, I first wrote the three methods. I used the same naming convention as PHP:

  • is_indexed_array
  • is_associative_array
  • is_multi_array

All three accept a parameter (array) and return a boolean (true or false). That helps in the decision process in dealing with the user input.

By the way, since an associative array can include a numeric key to mimic an indexed array, there are extra steps in determining the type of array.

Here's how I handled it. A multi-dimensional array has a unique characteristic. The value is an array. An indexed array has a unique characteristic. The key is numeric only -- however, you cannot use is_string() on the key since is_string() would return true (a number can be numeric or a string). Although a unique characteristic, it is hard to code for the logic portion. The last, an associative array, also has a unique characteristic. The key is a string ... but recall above, a number returns as a string too - so the key then can be what appears as a numeric value. 

The order, then, is clear. First try to detect a multi-dimensional array. Second try to detect an indexed array (see my note about this in the next paragraph). Third, try to detect an associative array.

Now a note. An indexed array is also known as a sequential array. I always use indexed array because sequential suggests the key is in sequential order -- and that's just not the case. Indexed array keys can be in order or not. 

It turned out that I needed a way to handle processing multi-dimensional arrays and a separate way to handle processing both indexed and associative arrays. In terms of the process, it doesn't matter much if the key is a string or numeric. It's the value that matters ... can't have that as an array.

Where the difference between indexed and associative array became an issue is after stripping out the email address ... recall that I said the remainder is thedisplay portion. A numeric value (as from an indexed array) isn't really the display portion. That's where the new method came into play.

This portion of the project is complete and user input creates arrays for each of the headers in an email. 




Add a comment