Crafting a basic Ethernet MAC and 10BASE-T PHY

Now that I’ve got my FPGA back from my friend (who went from not knowing Verilog to making small CPUs in a single weekend!) it’s time to do something with it. I got this little bad-boy because I had a couple of fairly specific, yet simple, network processing applications in mind that would benefit from using FPGAs rather than general CPUs. Looking around for FPGA engineers with some network knowledge who are willing to work for free didn’t yield any results. But the world of digital systems had grabbed my attention, so I decided to buy an FPGA and start learning more for the enjoyment. I went shopping, humble beginnings were called for, and so I selected the Basys 2 board from Digilent.

My 7x7mm package

My 7x7mm package

(They say size doesn’t matter, but when trying to process 40 Gbps of data it does. If anyone has a sexy Stratix V or Kintex/Virtex 7 development kit gathering dust: my email address is to the right!)

Enter 10BASE-T

Now that making LEDs blink (the veritable hello world of the FPGA programming world) is out of the way, it’s time for me to do something with an actual packet. Only problem is my little guy doesn’t have any Ethernet ports — no dedicated PHYs or MACs (and the Xilinx MAC requires a license) — so what to do? Well obviously I need to make my own MAC and PHY. I can’t just bake up an external PHY in the kitchen, so I’ll need to implement one on the FPGA, and that means working within the FPGA’s I/O capability. That’s where the grandaddy of Ethernet, 10BASE-T, comes in.

The thing about this beautiful two decade section of the 802.3 standard is it’s really quite simple. Particularly it’s PHY. It uses 2.5 volts of differential signalling running at 20 MHz. All within the capabilities of my FPGA. So without further ado it’s time to cut the end off a CAT5 cable and start jamming wires into I/O ports.

Now enhanced with Ethernet port

Now enhanced with Ethernet port

If following along at home you may also want to set your I/O ports to IOSTANDARD = LVCMOS25 in your UCF. I’m basically trusting that the NIC on my laptop will behave and won’t try to fry my FPGA (the FPGA does have a little bit of protection on the I/O ports). I’ll also set my laptop to 10/Full to make my life easier, rather than having to implement auto-negotiation, collision detection, jamming, and all that fine stuff.

So how are we going to attack this thing? Well, ye olde 10BASE-x is composed a bit differently than the later versions. In later standards the IEEE adopted a cleaner demarcation between the layer 2 (MAC) and layer 1 (PHY) services. But with 10BASE-x these are more tightly coupled. Here’s the picture:

802_3_diagram

The MAC is pretty standard and doesn’t change between versions. It interfaces with the PLS, which generates the physical layer signal and transmits it over the AUI interface to the Physical Medium Attachment (PMA) unit. The PMA is a dumb piece of electronics that converts the AUI interface to the wiring of your choice. Speaking of arcane wiring choices, anyone remember these:

10base2_t-piece

With 10BASE and its “T” medium, the twisted pair wiring is just connected through to the AUI interface. So, in reality there is no PMA (insert Neo reference). To build this thing I’m going to work backwards from the wires, to the PLS, and the MAC. I’ll start by making my FPGA bring the link up, and then transmit a packet.

What’s the frequency?

We need a 20 MHz transmit clock for the PLS. The Spartan 3E has a few Digital Clock Managers (DCMs) that can be used to produce divided, multiplied, or phase shifted clocks from a single source (more on the use of phase shifting when we deal with receiving data). The clock source on my board is 50 MHz, so below I’ve use a DCM to produce a single 20 MHz clock, and I’ll use that clock for everything on the FPGA to keep my timing simple.

    /* Generate the 20 MHz transmit clock. We'll drive everything from
     * this transmit clock to keep it simple. */
    
    wire clkdv;
    wire mclk0;
    wire locked;
    wire clk;
    
    DCM_SP #(
        .CLKIN_PERIOD(20.0),
        .CLKDV_DIVIDE(2.5),
        .CLK_FEEDBACK("1X")
    ) DCM_SP_inst (
        .CLKIN(mclk),
        .RST(rst),
        .CLK0(mclk0),
        .CLKFB(mclk0),
        .CLKDV(clkdv),
        .LOCKED(locked)
    );
    
    BUFG BUFG_inst (.I(clkdv), .O(clk));

Using different clock domains requires delving into the art of clock domain crossing: dealing with the fact that one domain is sending data at a faster or slower rate than the other domain is able to clock it in. Generally, using multiple clocks should be avoided if possible and at this point there is no good reason for us to do it. While I’m transmitting and receiving 1-bit every two clock cycles, I will also be serializing and deserializing that data into an 8-bit word width for processing in the MAC and the rest of the FPGA. So the data path through the rest of the FPGA will be more like 8*20 MHz = 160 Mbps. This illustrates why although a Virtex 7 FPGA running at 400 MHz sounds puny compared to the 3 GHz CPU in your computer, it’s massive parallelism makes it extremely powerful for non-general tasks.

Getting link

The next job is to build the PLS. This puppy is responsible for converting the bits from the MAC to the physical layer signal. When the MAC is transmitting the PLS encodes the data onto the wires using Manchester encoding and 2.5 volt differential signalling. The Manchester encoding sends a single bit over two clock cycles using transitions from low-to-high and high-to-low over to indicate a 0 and 1 respectively. This enables the receiver to detect bits despite not having a synchronized clock. Without some form of encoding like this or synchronized clocks, you couldn’t precisely detect long sequences of zeroes or ones in a transmission because there would be no changes on the wire. Manchester encoding is not very efficient though, so later standards adopted 8b/10b encoding (in which 8 bits becomes 10 bits on the wire) for 100BASE and 1000BASE and then 64b/66b encoding (in which scrambling is also used) for 10GBASE.

When it is not transmitting, the PLS is responsible for sending the IDL signal. IDL starts at the end of a transmission with the end of frame delimiter (EFD), which is two bit-times of the differential signal asserted with no transitions. This means that the receiver will detect no transitions over two bit-times and know that there is no longer a Manchester encoded signal being transmitted. That is followed by a repeating pattern of 16ms of silence and the link test pulse. Silence means that the PLS turns off the wires, so that the differential voltage drops to 0 volts. (Obviously that allows for other transmitters to sense the medium is free and start transmitting if we were using the shared medium.) The link test pulse is turning the wires back on so there is 2.5 volts differential for one bit-time.

The IDL pattern can be interrupted by the MAC at any time.

The PLS as described in 802.3 emits an output_next signal to request the next bit. This is essentially a 10 MHz clock enable that is only generated when data_complete is deasserted. This could be used drive a rd_en on a FIFO. But in my case I want to use a simple shift register, and it becomes more complicated to synchronize the output of a clock enable driven shift register and the PLS if the output_next signal is used to derive the clock enable. So instead I will create a 10 MHz synchronized clock enable for driving both the shift register and the PLS. The FIFO would be cool though and I may implement that later.

I’ve elected to use the same naming as the 802.3 standard. Except where the standard suggests OUTPUT_NEXT should supply a 0 or 1 or data complete signal, I’ve created a separate signal for data_complete. Think of it as an enable signal with active low.

module pls_tx (
    input wire clk,
    input wire rst,
    input wire ce,
    input wire data_complete,
    input wire output_unit,
    output wire td_p,
    output wire td_n
    );
    
    /* Timer for generating IDL pattern. */
    
    wire link_test_pulse;
    
    timer #(
        .INTERVAL(320000),
        .DURATION(2)
    ) idl_timer (
        .clk(clk),
        .rst(rst),
        .ena(data_complete),
        .irq(link_test_pulse)
    );
    
    /* Manchester encoded differential signalling. */
    
    reg td;
    
    always @(*) begin
        case ({data_complete, ce})
            2'b00: td = ~output_unit;
            2'b01: td = output_unit;
            default: td = 1;
        endcase
    end
    
    reg [3:0] efd = 0; // end of frame delimiter
    
    assign td_p = (~data_complete | link_test_pulse | (|efd)) ? td : 0;
    assign td_n = (~data_complete | link_test_pulse | (|efd)) ? ~td : 0;
    
    always @(posedge clk or posedge rst) begin
        if (rst) begin
            efd <= 0;
        end else begin
            if (data_complete) begin
                efd <= efd >> 1;
            end else begin
                efd <= 4'b1111;
            end
        end
    end
endmodule

At this point we should be able to connect our FPGA and get link up by virtue of the IDL pattern. And sure enough, we do. Witness all the crap various services on my laptop are spewing out:

Link Up

Serializing data

To feed data to the PLS one bit at a time I’ll use a shift register. This will serialize the 8-bit wide data in the MAC down to the 1-bit width of the PLS. The shift register needs to shift in clk/2 time because of the Manchester encoding that the PLS is doing. So we’ll use the same clock enable source for the PLS to drive the shift register.

module shift_register (
    input wire clk,
    input wire rst,
    input wire ce,
    input wire [7:0] in,
    output wire out,
    output wire empty,
    output wire almost_empty
    );
    
    reg [7:0] sreg = 0;
    reg [2:0] scnt = 0;
    assign out = sreg[0];
    assign empty = (ce & scnt == 0);
    assign almost_empty = (ce & scnt == 1);
    
    always @(posedge clk or posedge rst) begin
        if (rst) begin
             sreg <= 0;
             scnt <= 0;
        end else begin
            if (ce) begin
                if (scnt == 0) begin
                    sreg <= in;
                    scnt <= 7;
                end else begin
                    sreg <= sreg >> 1;
                    scnt <= scnt - 1;
                end
            end
        end
    end
endmodule

As mentioned another option here is to use a small 8:1 aspect ratio first word fall through FIFO. We could then use output_next signal from the PLS to drive rd_en on the FIFO, the empty signal from the FIFO to drive data_complete on the PLS, and the the full signal from the FIFO for driving the state machine (discussed below). But for now a shift register is simpler.

Flying Spaghetti Machines

To finish off, we need to add some logic to wrap the preamble, start of frame delimiter, any required padding, and the frame check sequence (CRC32) around the frame generated by the MAC client. A state machine makes sense here.

Transmit State Machine

That gets quite a bit more elaborate:

    /* State machine to wrap a preamble, any necessary padding, and FCS
     * around a frame. */
    
    localparam idling = 0, sending_preamble = 1, sending_sfd = 2,
        sending_payload = 3, sending_padding = 4, sending_fcs = 5, waiting_ipg = 6;
    
    reg transmitting = 0;
    reg [15:0] count = 0;
    reg [2:0] state = 0;
    reg accept_data = 0;
    reg [7:0] buffered_data = 0;
    
    assign data_complete = ~transmitting;
    assign led = transmitting;
    assign dack = (empty & dvld & accept_data);
    assign crc_dvld = (empty & (state == sending_payload || state == sending_padding));
    
    always @(posedge clk or posedge rst) begin
        if (rst) begin
            transmitting <= 0;
            count <= 0;
            state <= waiting_ipg;
            crc_ena <= 0;
            accept_data <= 0;
        end else begin
            case (state)
                idling: begin
                    transmitting <= 0; // redundant
                    count <= 0;
                    
                    if (dvld) begin
                        state <= sending_preamble;
                    end
                end
                sending_preamble: begin
                    if (empty) begin
                        output_data <= 8'h55;
                        count <= count + 1;
                        
                        if (count == 1) begin // Delay transmit enable because output_data lags a cycle
                            transmitting <= 1;
                        end
                        
                        if (count == 6) begin
                            accept_data <= 1;
                            count <= 0;
                            state <= sending_sfd;
                        end
                    end
                end
                sending_sfd: begin
                    if (empty) begin
                        output_data <= 8'hD5;
                        count <= 0;
                        
                        if (dvld) begin
                            buffered_data <= data;
                            crc_ena <= 1;
                            state <= sending_payload;
                        end else begin
                            state <= waiting_ipg;
                        end
                    end
                end
                sending_payload: begin
                    if (empty) begin
                        output_data <= buffered_data;
                        count <= count + 1;
                        buffered_data <= data;
                        
                        if (~dvld | ~accept_data) begin  
                            if (count < 59) begin
                                state <= sending_padding;
                            end else begin
                                count <= 0;
                                state <= sending_fcs;
                            end
                        end
                        
                        if (count == 2000) begin
                            // Okay buddy, enough's enough.
                            accept_data <= 0;
                            count <= 0;
                            state <= sending_fcs;
                        end
                    end
                    
                    if (empty) begin
                        accept_data <= 0;
                    end
                end
                sending_padding: begin
                    if (empty) begin
                        output_data <= 8'h00;
                        count <= count + 1;
                        
                        if (count == 60) begin
                            count <= 0;
                            state <= sending_fcs;
                        end
                    end
                end
                sending_fcs: begin
                    if (empty) begin
                        case (count)
                            0: output_data <= crc[7:0];
                            1: output_data <= crc[15:8];
                            2: output_data <= crc[23:16];
                            3: output_data <= crc[31:24];
                        endcase
                        count <= count + 1;
                        
                        if (count == 3) begin
                            count <= 0;
                            state <= waiting_ipg;
                        end
                    end
                end
                waiting_ipg: begin
                    if (empty) begin
                        transmitting <= 0;
                        crc_ena <= 0;
                        count <= count + 1;
                        
                        if (count == 191) begin
                            count <= 0;
                            state <= idling;
                        end
                    end
                end
                default: state <= idling;
            endcase
        end
    end

Don’t forget the FCS

As the output_data is updated, the CRC continues the next step in its calculation. See the source code for the CRC32 generator.

Sending the packet

To wrap it all up we want to load a packet and transmit it. To do this I’ve created a text file to initialize into a ROM, and a timer to send the contents to the MAC every second. The packet is just a simple ICMP echo request, you can see it here.

    /* Load our packet. */
    
    localparam num_lines = 128;
    reg [7:0] rom [num_lines-1:0]; // Blows up if 58 bytes, wtf?
    initial $readmemh("packet.txt", rom, 0, num_lines-1);
    
    /* Use a timer to transmit a packet every second. */
    
    wire start_tx;
    
    timer #(
        .INTERVAL(20000000-1),
        .DURATION(1)
    ) tx_timer (
        .clk(clk),
        .rst(rst),
        .ena(locked),
        .irq(start_tx)
    );
    
    /* Transmit packet. */
    
    localparam last_addr = 58;
    reg [7:0] addr = 0;
    
    always @(posedge clk) begin
        if (dvld) begin
            if (dack) begin
                if (addr == last_addr) begin
                    dvld <= 0;
                end else begin
                    data <= rom[addr];
                    addr <= addr + 1;
                end
            end
        end else begin
            if (start_tx) begin
                data <= rom[0];
                addr <= 1;
                dvld <= 1;
            end
        end
    end

Aaaannnnnddd… wow, it actually works:

We're receiving our ICMP echo request packets, all valid, with correct FCS!

We’re receiving our ICMP echo request packets, all valid, with correct FCS!

Also, it seems there is a bug in the way Wireshark or my NIC parses frames. One or the other seems to infer the frame length from the length indicated in the IP header, and cuts it short if it’s longer. This caused much confusion while trying to work out where my FCS was going wrong and not realizing I was adding too many padding bytes.

What next?

The next part of my project is to make it receive packets. This will be far more interesting because we have to recover data from the wire despite unsynchronized clocks on the sender and receiver.

And about 10,000 hours after that, I expect I’ll have one of these:

The new model KJP 9000 promises to run your network without killing crew

The new model KJP 9000 promises to run your network without killing your crew

The code

I’ll start using Github at some point, but in the meantime here’s the code for the playful:

6 thoughts on “Crafting a basic Ethernet MAC and 10BASE-T PHY

  1. Monster

    Excellent work. Very didactic also.

    I am trying to do the same with my Spartan 6 board.

    Did you manage to finish the TX part?

    Regards!

  2. Kris Price Post author

    Oh those links to the src are indeed broken, they should work now. :) I didn’t finish the RX part, I got a job and gave away my FPGA. The interesting part was the method to sample data on four phases of the same clock. There’s a Xilinx document about it, if you search for something like “data phase recovery xilinx” you’ll probably find it. Let me know how you get on I’d be interested to see the results. :)

  3. Monster

    Hehe, congrats then!

    Thank you for updating the links; they work perfectly now.

    I do not have much time to do everything now from scratch (I am more a VHDL guy), but I will asap.

    I will also check the document you recommend.

    Thank you and keep up the good work!

  4. webphyfpga

    Nice work! If you’re interested in a PHY-less web-server IP core, download the WebPHY DATABUS core from http://www.webphyfpga.com. This core sends and receives data between a FPGA and web client over Ethernet using “rd” and “wr” commands over HTTP. The core features a user-customizable web page allowing browser-based control of the FPGA. The core connects to Ethernet via standard LVDS-configured IOBs on the FPGA. No external PHY or DDR/Flash memory chips, software TCP stack or embedded CPU are required – everything is contained within the core.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>