Menu
Is free
registration
home  /  Advice/ What are XML parsers for and how they can be useful. Parsing XML with SimpleXML Php parsing xml

What are XML parsers for and how they can be useful. Parsing XML with SimpleXML Php parsing xml

The other day I began to rework my system of internal reporting of the company, about the general structure of which I wrote not so long ago. Without bending my heart, I will say that I grew above myself in terms of PHP, and, as a result, I realized that the algorithm of the system is crooked enough for me to rewrite it.

Prior to this, the XML document was parsed using functions borrowed from PHP 4. However, PHP5 gave the world a very handy thing called SimpleXML. How to work with him will be discussed today.

It's worth starting with the fact that SimpleXML is a separate plug-in, and therefore it must be connected in advance on the server being used.

Now we can work!

In order to process the document, we use the simplexml_load_file () function. As a parameter, it is passed the address of the file in the eXtended Markup Language format (XML - your K.O.).

The beauty of this function is that you can easily transfer a file to it from any server. Thus, we have the ability to process external xml-uploads (for example, Yandex-XML or third-party RSS feeds).

The function outputs an array at the exit. The pitfall that I encountered is that XML can have a clumsy structure, and therefore I advise you to start with a figurative trace and output the array in order to understand how the function handled it. After that, you can start processing the received data.

For example, I'll take a simple construction from here:


>
>
> PHP: Introducing the Parser >
>
>
> Ms. Coder >
> Onlivia Actora >
>
>
> Mr. Coder >
> El ActÓ r >
>
> > Mr. Parser > > John doe > > >
>
So it is a language. It's still a programming language. Or
is it a scripting language? Everything is revealed in this documentary
like a horror movie.
>
>
> PHP solves all my web tasks >
>
7>
5>
PG > >
>

Let it be the file export.xml, which lies right in the root of my server, along with the script that processes it.
The array is built according to the structure of the DOM elements in the XML document. Processing starts from the root. To get the name Ms. Coder, we have to build the following path: $ xml-> movies-> movie-> characters-> character-> name.
Please note that we are choosing a specific value. This is where this kind of character is taken - do not forget that we are working with an array!

As with any array, our data can be processed using a foreach loop. The code will be like this:

$ xml = simplexml_load_file ("export.xml"); // uploaded the file
$ ttl = $ xml -> movies -> movie -> title; // got the title. it is one, so there is no need to put any other value

foreach ($ xml -> movies -> movie -> caracters as $ crc) // and now let's work in dynamics
{
// display the names of the heroes
$ name = $ crc -> caracter -> name;
echo (" $ name
"
) ;
}

This code will put the text "PHP: Parser Appearance" in the $ ttl variable, and then display the heroes' names line by line
Ms. Coder, Mr. Coder, Mr. Parser.

28.3K

I've seen a lot of xml parsers, but I haven't touched on web programming. Now I want to find out and learn with you how to make a simple xml parser in php.

What for? Necessary!

No, well, actually: xml files are a very useful thing. And any professional should ... no, he shouldn't, but should know how to work with them. Do we want to become professionals? If you are on my blog, then you have such a desire.

We will assume that we know what XML is and will not describe it here. Well, if we don’t know, then we can easily find out here: http://ru.wikipedia.org/wiki/XML

While looking for ways to parse XML in PHP, I discovered a simple set of functions in PHP for working with XML files called "XML Parser Functions". Parsing begins with initializing the parser by calling the xml_parser_create function:

$ xml_parser = xml_parser_create ();

Then we need to tell the parser which functions will process the xml tags it comes across and text information in the process of parsing. Those. you need to install some handlers:

xml_set_element_handler ($ xml_parser, “startElement”, “endElement”);

This function is responsible for setting the start of the element and the end of the element handlers. For example, if a combination is found in the text of an xml file, then the startElement function will be triggered when the parser finds the element, and the endElement function when found.

The startElement and endElement functions themselves take several parameters according to the php documentation:

But how do you read data from a file? We haven't seen a single parameter for this in any of the functions yet! And more on that later: reading the file is the responsibility of the programmer, i.e. we have to use standard functions for working with files:

Opened the file. And now you need to read it line by line and feed the read lines to the xml_parse function:

XML Error: ".xml_error_string (xml_get_error_code ($ xml_parser)); echo" at line ".xml_get_current_line_number ($ xml_parser); break;))?>

There are two very important things to note here. The first is that the xml_parse function in the third parameter needs to pass the flag of reading the last line (true - if the line is the last, false - if not). The second is that, as in any business, we must watch out for mistakes here. The functions xml_get_error_code and xml_error_string are responsible for this. The first function receives the error code, and the second, based on the received code, returns a textual description of the error. What will happen as a result of the error - we will consider later. No less useful function xml_get_current_line_number will tell us the number of the currently processed line in the file.

And as always, we must release the resources occupied by the system. For parsing XML, this is the xml_parser_free function:

xml_parser_free ($ xml_parser);

Here, we have covered the main functions. It's time to see them in practice. For this I came up with an xml file with a very simple structure:




123

71234567890

Let's call this file data.xml and try to parse it using the following code:

Element: $ name
"; // element name $ depth ++; // increase the depth so that the browser shows indents foreach ($ attrs as $ attr => $ value) (echo str_repeat (" ", $ depth * 3); // indents // display the name attribute and its value echo "Attribute:". $ attr. "=". $ value. "
";)) function endElement ($ parser, $ name) (global $ depth; $ depth--; // decrease the depth) $ depth = 0; $ file =" data.xml "; $ xml_parser = xml_parser_create (); xml_set_element_handler ($ xml_parser, "startElement", "endElement"); if (! ($ fp = fopen ($ file, "r"))) (die ("could not open XML input");) while ($ data = fgets ($ fp)) (if (! xml_parse ($ xml_parser, $ data, feof ($ fp))) (echo "
XML Error: "; echo xml_error_string (xml_get_error_code ($ xml_parser)); echo" at line ".xml_get_current_line_number ($ xml_parser); break;)) xml_parser_free ($ xml_parser);?>

As a result of the simplest script we developed, the browser displayed the following information in its window:

Element: ROOT
Element: INFO
Attribute: WHO = mine
Element: ADDRESS

Attribute: KVARTIRA = 12
Attribute: DOM = 15
Element: PHONE

Let's try to spoil the XML file by replacing the tag On and leaving the closing tag the same:

Element: ROOT
Element: INFO
Attribute: WHO = mine
Element: ADDRESS
Attribute: ULICA = my street !!
Attribute: KVARTIRA = 12
Attribute: DOM = 15
Element: TELEPHONE

XML Error: Mismatched tag at line 5

Wow! The error messages are working! Moreover, they are quite informative.

Eh, I forgot one more thing ... We didn't display the text contained inside the address and phone tags. We fix our shortcoming - we add a text handler using the xml_set_character_data_handler function:

xml_set_character_data_handler ($ xml_parser, 'stringElement');

And add the handler function itself to the code.

Xml parser is a program that extracts from source file xml format data and save or use for subsequent actions.

Why are xml parsers needed?

Primarily because the xml format itself is popular among computer standards. The XML file looks like this:

those. in fact there are tags, there are some rules for which tags should follow each other.

The reason for the popularity of xml files is that they are highly human readable. And the fact that it is relatively easy to handle in programs.

Cons of xml files.

The downside is, first of all, a large amount of disk space that this data takes up. Due to the fact that tags, which are constantly repeated, with large amounts of data, I take up a relatively large amount of megabytes, which just need to be downloaded from the source, and then processed. Are there any alternatives? There are, of course, but all the same, the xml and xml parsers are today one of the simplest and most reliable and technologically popular formats.

How are XML parsers written?

Parsers are written in programming languages. As they say, they are written in all, but not some more. It should be understood that there are programming languages ​​that already have built-in libraries for parsing xml files. But in any case, even if there is no library, you can always find a suitable library for this case and use it to extract data from a file.

Globally, there are 2 different approaches to how to parse xml files.

The first is to load the xml file completely into memory and then do data extraction manipulations.

The second is the streaming option. In this case, the programming language defines certain tags to which the functions of the generated xml parser need to respond, and the programmer himself decides what needs to be done if a particular tag is found.

The advantage of the first approach is speed. Immediately I loaded everything into the file, then quickly ran through my memory and found what was needed and most importantly, programming was easy. but there is a Minus and very important - it is

a large amount of memory is required to operate. Sometimes, I would even say it often happens that it is simply impossible to process and parse the xml file, i.e. create an xml parser to work correctly in the first way. Why is that? Well, for example, the limitation for 32-bit applications under Windows allows the program to occupy a maximum of 2 gigabytes of memory - no more.

However, it is difficult to program using the streaming option. The complexity with a sufficiently serious extraction grows significantly, which accordingly affects the timing and budget.

The validity of xml files and parsers.

Everything would be fine with xml files and xml parsers, but there is a problem. Due to the fact that "any student" can create an xml file, but in reality it is (because a lot of code is written by schoolchildren, invalid files appear, that is, incorrect ones. What does this mean and what is the risk? , this is that it is simply impossible sometimes to correctly parse an invalid file. For example, its tags are not closed as one would expect by the standard, or for example the encoding is set incorrectly. Another problem is that if, for example, you are doing a parser on .net, then you can create so-called wrappers , and the most annoying thing happens when you make such a wrapper, and then you read it a file that the "student" created, but the file is invalid and it is impossible to read it. = because many people create xml files without using standard libraries and with complete aversion to all xml file standards. It's hard to explain to customers. They are waiting for the result - an xml parser that converts the data from the original file to a different format.

How to create xml parsers (first option)

There is such a language for querying XML data as Xpath. This language has two editions, we will not delve into the specifics of each version. A better understanding of this language will show examples of how to use it to retrieve data. For example.

// div [@ class = "supcat guru"] / a

what this request is doing. It takes all the tags that have a range containing the text catalog.xml? Hid = and this tag should be a child div with a class equal to supcat guru.

Yes, for the first time it may not be clear enough, but, nevertheless, you can figure it out if you want. The starting point for me is http://en.wikipedia.org/wiki/XPath and I advise you.


the publication of this article is allowed only with a link to the website of the author of the article

In this article, I will show you an example of how to parse a large XML file. If your server (hosting) does not prohibit an increase in the running time of the script, then you can parse an XML file weighing at least gigabytes, he personally parsed only files from ozone weighing 450 megabytes.

There are two problems when parsing large XML files:
1. Not enough memory.
2. There is not enough time allocated for the script to work.

The second problem with time can be solved if the server does not prohibit it.
But the problem with memory is difficult to solve, even if we are talking about your own server, then it is not very easy to move files of 500 megabytes, and it is simply not possible to increase memory on hosting and on VDS.

PHP has several built-in XML processing options - SimpleXML, DOM, SAX.
All of these options are detailed in many example articles, but all examples demonstrate working with a complete XML document.

Here is one example, we get an object from XML file

Now you can process this object, BUT ...
As you can see, the entire XML file is read into memory, then everything is parsed into an object.
That is, all data goes into memory, and if the allocated memory is small, then the script stops.

For processing large files this option is not suitable, you need to read the file line by line and process this data in turn.
In this case, the validity check is carried out in the same way as the data is processed, so you need to be able to rollback, for example, delete all the data entered into the database in the case of an invalid XML file, or make two passes through the file, first read for validity, then read for processing data.

Here is a theoretical example of parsing a large XML file.
This script reads one character at a time from the file, assembles that data into blocks, and sends it to the XML parser.
This approach completely solves the memory problem and does not cause stress, but exacerbates the problem over time. How to try to solve the problem over time, read below.

Function webi_xml ($ file)
{

########
### function for working with data

{
print $ data;
}
############################################



{
print $ name;
print_r ($ attrs);
}


## end tag function
function endElement ($ parser, $ name)
{
print $ name;
}
############################################

($ xml_parser, "data");

// open the file
$ fp = fopen ($ file, "r");

$ perviy_vxod = 1; $ data = "";



{

$ simvol = fgetc ($ fp); $ data. = $ simvol;


if ($ simvol! = ">") (continue;)


echo "

break;
}

$ data = "";
}
fclose ($ fp);

Webi_xml ("1.xml");

?>

In this example, I put everything in one function webi_xml () and at the very bottom you can see its call.
The script itself consists of three main functions:
1. The function that catches the opening of the startElement () tag
2. A function that catches the endElement () tag closing
3. And the function of receiving data data ().

Suppose the contents of the file 1.xml are some recipe



< title >Simple bread
< ingredient amount = "3" unit = "стакан" >Flour
< ingredient amount = "0.25" unit = "грамм" >Yeast
< ingredient amount = "1.5" unit = "стакан" >Warm water
< ingredient amount = "1" unit = "чайная ложка" >Salt
< instructions >
< step > Mix all ingredients and knead thoroughly.
< step > Cover with a cloth and leave for one hour in a warm room..
< step > Knead again, put on a baking sheet and put in the oven.
< step > Visit site site


We start all by calling the common function webi_xml ("1.xml");
Further in this function, the parser starts and all tag names are converted to upper case so that all tags have the same case.

$ xml_parser = xml_parser_create ();
xml_parser_set_option ($ xml_parser, XML_OPTION_CASE_FOLDING, true);

Now we indicate which functions will work for catching the opening of the tag, closing and processing the data

xml_set_element_handler ($ xml_parser, "startElement", "endElement");
xml_set_character_data_handler($ xml_parser, "data");

Next comes the opening the specified file, iterate over the file one character at a time and add each character to the string variable until a character is found > .
If this is the very first access to the file, then along the way everything that will be superfluous at the beginning of the file will be deleted, everything that is before , this is the tag that XML should start with.
For the first time, the string variable will collect the string

And send her to the parser
xml_parse ($ xml_parser, $ data, feof ($ fp));
After processing the data, the string variable is discarded and the data collection into the string starts again and the string is formed a second time

In the third
</b><br>on the fourth <br><b>Simple bread

Please note that the string variable is always formed by the finished tag > and it is not necessary to send an open and closed tag with data to the spider for example
Simple bread
It is important for this handler to get a whole not broken tag, at least one open tag, but in the next step a closed tag, or immediately get 1000 lines of the file, it does not matter, the main thing is that the tag does not break, for example

le> Simple bread
You cannot send data to the handler in this way, because the tag has broken.
You can come up with your own method of sending data to the handler, for example, collect 1 megabyte of data each and send to the handler to increase speed, just make sure that the tags are always completed and the data can be broken
Simple</b><br><b>bread

Thus, in parts, as you wish, you can send large file into the handler.

Now let's look at how this data is processed and how to get it.

Let's start with the opening tags function startElement ($ parser, $ name, $ attrs)
Suppose processing has reached the line
< ingredient amount = "3" unit = "стакан" >Flour
Then, inside the function, the variable $ name will be equal to ingredient that is, the name of the open tag (it hasn't come to closing the tag yet).
Also, in this case, an array of attributes of this $ attrs tag will be available, in which there will be data amount = "3" and unit = "glass".

After that, the processing of the data of the open tag with the function data ($ parser, $ data)
The $ data variable will contain everything between the opening and closing tags, in our case this is the text Flour

And the processing of our string is completed by the function endElement ($ parser, $ name)
This is the name of the closed tag, in our case $ name will be equal to ingredient

And after that, everything went in a circle again.

The above example only demonstrates the principle of XML processing, but for real use it needs to be improved.
Usually, you have to parse large XML to enter data into the database, and for correct data processing you need to know which open tag the data belongs to, which tag nesting level and which tags are open in the hierarchy higher. With this information, you can correctly process the file without any problems.
To do this, you need to enter several global variables that will collect information about open tags, nesting and data.
Here's an example that you can use

Function webi_xml ($ file)
{
global $ webi_depth; // counter to track the nesting depth
$ webi_depth = 0;
global $ webi_tag_open; // will contain an array of open ones in this moment tags
$ webi_tag_open = array ();
global $ webi_data_temp; // this array will contain the data of one tag

####################################################
### function for working with data
function data ($ parser, $ data)
{
global $ webi_depth;
global $ webi_tag_open;
global $ webi_data_temp;
// add data to the array indicating the nesting and currently open tag
$ webi_data_temp [$ webi_depth] [$ webi_tag_open [$ webi_depth]] ["data"]. = $ data;
}
############################################

####################################################
### opening tags function
function startElement ($ parser, $ name, $ attrs)
{
global $ webi_depth;
global $ webi_tag_open;
global $ webi_data_temp;

// if the nesting level is no longer zero, then one tag is already open
// and the data from it is already in the array, you can process it
if ($ webi_depth)
{




" ;

print "
" ;
print_r ($ webi_tag_open); // array of open tags
print "


" ;

// after processing the data, delete it to free memory
unset ($ GLOBALS ["webi_data_temp"] [$ webi_depth]);
}

// now the opening of the next tag has started and further processing will take place at the next step
$ webi_depth ++; // increase nesting

$ webi_tag_open [$ webi_depth] = $ name; // add the open tag to the information array
$ webi_data_temp [$ webi_depth] [$ name] ["attrs"] = $ attrs; // now add the tag attributes

}
###############################################

#################################################
## end tag function
function endElement ($ parser, $ name) (
global $ webi_depth;
global $ webi_tag_open;
global $ webi_data_temp;

// this is where data processing starts, for example, adding to the database, saving to a file, etc.
// $ webi_tag_open contains a chain of open tags by nesting level
// for example $ webi_tag_open [$ webi_depth] contains the name of the open tag whose information is currently being processed
// $ webi_depth tag nesting level
// $ webi_data_temp [$ webi_depth] [$ webi_tag_open [$ webi_depth]] ["attrs"] array of tag attributes
// $ webi_data_temp [$ webi_depth] [$ webi_tag_open [$ webi_depth]] ["data"] tag data

Print "data". $ webi_tag_open [$ webi_depth]. "-". ($ webi_data_temp [$ webi_depth] [$ webi_tag_open [$ webi_depth]] ["data"]). "
" ;
print_r ($ webi_data_temp [$ webi_depth] [$ webi_tag_open [$ webi_depth]] ["attrs"]);
print "
" ;
print_r ($ webi_tag_open);
print "


" ;

Unset ($ GLOBALS ["webi_data_temp"]); // after processing the data, delete the entire array with data, since the tag was closed
unset ($ GLOBALS ["webi_tag_open"] [$ webi_depth]); // remove information about this open tag ... since it closed

$ webi_depth -; // decrease nesting
}
############################################

$ xml_parser = xml_parser_create ();
xml_parser_set_option ($ xml_parser, XML_OPTION_CASE_FOLDING, true);

// specify which functions will work when opening and closing tags
xml_set_element_handler ($ xml_parser, "startElement", "endElement");

// specify a function for working with data
xml_set_character_data_handler($ xml_parser, "data");

// open the file
$ fp = fopen ($ file, "r");

$ perviy_vxod = 1; // flag to check the first file entry
$ data = ""; // here we collect data from the file in parts and send it to the xml parser

// loop until end of file is found
while (! feof ($ fp) and $ fp)
{
$ simvol = fgetc ($ fp); // read one character from file
$ data. = $ simvol; // add this symbol to the data to send

// if the character is not an end tag, then we go back to the beginning of the loop and add another character to the data, and so on until the end tag is found
if ($ simvol! = ">") (continue;)
// if a closing tag was found, now send this collected data for processing

// check if this is the first entry into the file, then delete everything that is before the tag// since sometimes garbage can be found before the beginning of XML (clumsy editors, or the file is received by a script from another server)
if ($ perviy_vxod) ($ data = strstr ($ data, "

// now we throw the data into the xml parser
if (! xml_parse ($ xml_parser, $ data, feof ($ fp))) (

// here you can process and get errors for validity ...
// as soon as an error is encountered, parsing stops
echo "
XML Error: ". Xml_error_string (xml_get_error_code ($ xml_parser));
echo "at line". xml_get_current_line_number ($ xml_parser);
break;
}

// after parsing, discard the collected data for the next step of the cycle.
$ data = "";
}
fclose ($ fp);
xml_parser_free ($ xml_parser);
// remove global variables
unset ($ GLOBALS ["webi_depth"]);
unset ($ GLOBALS ["webi_tag_open"]);
unset ($ GLOBALS ["webi_data_temp"]);

Webi_xml ("1.xml");

?>

The whole example was accompanied by comments, now test and experiment.
Please note that in the function of working with data, data is not simply inserted into an array, but is added using " .=" since the data may not come in its entirety and if you make a simple assignment, then from time to time you will receive data in chunks.

Well, that's all, now there is enough memory when processing a file of any size, but the script's running time can be increased in several ways.
At the beginning of the script, insert the function
set_time_limit (6000);
or
ini_set ("max_execution_time", "6000");

Or add the text to your .htaccess file
php_value max_execution_time 6000

These examples will increase the running time of the script to 6000 seconds.
You can increase the time in this way only in the disabled mode.

If you have access to edit php.ini you can increase the time with
max_execution_time = 6000

For example, on the hosting masterhost at the time of this writing, increasing the script time is prohibited, despite the disabled safe mode, but if you are a pro, you can make your own php assembly on the masterhost, but this is not about that in this article.

In the last article we are with you, and I promised that in the next article we will parse it. And today I will show you how you can parse XML document in PHP.

I propose to parse the document we created in the last article, and simply output the data from there to the browser. Here is the script code:

$ dom = new domDocument ("1.0", "utf-8"); // Create XML document version 1.0 with utf-8 encoding
$ dom-> load ("users.xml"); // Load XML document from file into DOM object
$ root = $ dom-> documentElement; // Get the root element
$ childs = $ root-> childNodes; // Get the children of the root element
/ * We iterate over the received elements * /
for ($ i = 0; $ i< $childs->length; $ i ++) (
$ user = $ childs-> item ($ i); // Get the next item from the NodeList
$ lp = $ user-> childNodes; // Get the children of the "user" node
$ id = $ user-> getAttribute ("id"); // Get the value of the "id" attribute of the "user" node
$ login = $ lp-> item (0) -> nodeValue; // Get the value of the "login" node
$ password = $ lp-> item (1) -> nodeValue; // Get the value of the node "password"
/ * Print the received data * /
echo "ID: $ id
";
echo "Login: $ login
";
echo "Password: $ password
";
echo "-----------------------
";
}
?>

From this code, you must not only understand how to parse XML document in PHP but also what myself the parsing process depends on the structure of the document... That is, you must know what the structure is, otherwise it will be problematic to parse such a document. I already once wrote that the main feature of XML is the strictness of the syntax... I hope you now understand why this is so important. Without this " code rigidity"it would be extremely difficult to parse documents, and this thing is very often required. It's elementary when importing some data from XML file with their subsequent placement in the database.