[Back to blog]
[Link] tags: code NASM assembly breakout
05 June 2018
Progress is a bit slow because I'm coding in assembly, which I'm not experienced with, but doubly so because it is exam season.
However, I have now managed to draw image files, flat color rectangles and the game supports input now, so I can move the player from one side to the other, which is pretty much the entirety of user input in Breakout.
For user input, I'm reading the correct /dev/input device, using the poll syscall to check if data is available, and then just the read syscall to read if it is. The Linux input API is decently simple, as it pretty much is only one data structure, for each event.
In the image, the entire screen is first filled with an image I drew, and then that red box is the player pad. I do not know exactly what the artifacting along the lines, in the image are. My guess is it has to do with the blending. I intend to figure that out at a later point.
The movement is currently too sharp, and I think that's because I do not do any blending at all, so the player pad jumps from line to line. I'm quite excited to test out different blending techniques, and seeing the result. I do not know anything about blending, but I want to try on my own first, and my initial guess is that it's a pretty simple case of converting a continuous variable to a discrete one, with steps.
[Link] tags: C code breakout blog
02 June 2018
I have managed to get my breakout clone to draw images into a window, and so I would like to post pictures alongside my blog posts from now on. This meant adding features to my blog software.
There were two major features that I had been excited about implementing. One is images, and the other is tags. I implemented both.
When I was thinking about how to implement image support, I had the idea that I wanted my entire blog to be just one file, and therefore one connection, instead of several connections, one for each image and the blog. I'm not an expert in networking, but I do know about the caravan concept, and I've read some about how connections take time to setup. I also decided this approach, because I've unfortunately had some web development experience, and knew that it was possible to embed image data directly into the img tag, using base64 encoding.
Once I finished implementing the base64 embed, I decided to test it against just linking to a PNG file, and it turns out that there was no speed difference. I'm not used to analysing network speeds, and I just used my browsers dev tools, but it looks like the browser just waits a bunch of milliseconds to draw the embedded base64 image. The way I read it from the "waterfall", is that downloading the HTML data is about the same, but then one version takes a decent amount of time to connect, and download the PNG, while the embedded version just waits for the same amount of time as the connection would have taken.
This is disappointing because with the embedded version, the source of my website becomes much worse to read and people can't download my images, unless they copy them, and paste them into some software like The GIMP.
I have to conclude that I had a bad idea, but I'm going to keep the embedded version, because then the code gets some use and I don't want to make this project a chore. So far it has been a wonderful project that brings result, with almost no effort.
It became clear to me that instead of any complexity, I could just create a file for every tag page. So now I have a tags subdirectory, with a file for each tag, and each file contains the entire HTML necessary to work correctly. This means that everything gets copied however many times is necessary, but I consider it worth the price, when considering the savings on complexity.
It also gave me an exercise in operating on data. I decided that when each tag is parsed, a linked list of tags get searched and when a hit is made, a pointer to the blog entry is added to that node, or a new node gets added. Then when I generate the tag files, I go through the list, create a file based on the name, create a new AST for each entry on that tag node and then I run the exact same AST writing code that creates the full blog on that file.
One other major change I made is to make my parser two step. I do not know if I will keep it this way, but I've really wanted to make a two step parser for some time, and so I just decided to make that an excuse for this one.
[Link] tags: code NASM assembly breakout
Date lost forever
I have been writing this NASM Breakout clone to draw to fbdev on Linux. This means that in order to run it, I have to jump over to another TTY than the one I'm running X on, and run my application from there. I'm doing it this way, because at first I wanted to write a project that draws to fbdev in C, but then it made sense to also learn assembly, because that's something I've been wanting to do for some time.
I want to make a game that draws to fbdev for many reasons that I don't feel like explaining. It just feels depressing to have to explain myself all the time.
However at this point it's getting in the way that I have to switch TTYs. For example, it's getting in the way of debugging. So my next idea is to write a program in C that draws a memory mapped file onto a window, with help from SDL. I could also call the assembly code from C, or I could call C code from assembly. I'm not certain I'm doing the best thing, but drawing memory is easy.
And it turned out to be pretty easy. I had a wrong assumption that mmap would resize the file for me, but after learning that it didn't, it was smooth sailing, and now my SDL program updates at vsync, which means that any draws I make from my real program, I can see in the SDL window.
With this working I wrote my first drawing code. Previously I had just created image files that fit my window size, and done a memcpy from the image data to my frontbuffer. Now I have a render_color_rect function that takes as input xmm0, esi and rdx. xmm0 is rectangle data, with each of the four components being a doubleword in size. esi is color data, with each of the four components being a byte in size, and rdx is a pointer to the memory we write to.
I'm writing this according to the System V Application Binary Interface covention, but I'm not sure how to deal with SSE registers. I just guessed that the second register argument still should be rsi.
My drawing code just uses rep stos, because I'm not yet sure how to deal with the fact that SIMD code would handle 4 colors at a time.
[Link] tags: C code design breakout blog
8 May 2018
When I started writing this blog, I immediately wrote a program in C that I call "blog_composer", which in the beginning just iterated through a directory, non-recursively, and combined each file into one HTML file. I copied a quicksort algorithm from Wikipedia, so that the files get ordered before I read from them, meaning that the file with the name that has the smallest value, is the first blog post. Then I just named my files blog001, blog002, etc.
It turns out this was not sufficient, because HTML does not recognise the newline character (\n) as a new line, and so I had to do some parsing. I decided to do the easiest thing, and just replace the \n character, with the br tag. But the br tag is more characters than \n, and my code just read the file into a buffer. I then decided to just put the work in, and write a tokenizer and AST tree.
Fortunately I did it in a short day, because I've done this kind of work before, and so now my code sets a start position, then reads until EOF or it finds \n. It then sets that as an end position, creates a "token", which is just a range in the buffer, and it adds the token to an array, and one token for \n, if it wasn't EOF. It then sets the start position again, and repeats.
Once the entire file is tokenized, I read the tokens and generate nodes for the AST from that.
This worked, and I felt I could write for some time now at least, without it bothering me. But I had URLs in my first post, and it was disapointing that those weren't clickable, so I couldn't stop thinking about improving it immediately after.
A day or two later I'm again at the code, and now I've decided to properly tokenize it. Every word is a token and they get added to an array. Tokens are seperated by whitespace, and then I have a special test for '\n' that adds an extra token for newline, after adding the word. I was a bit worried that the code would get very complicated, because I don't want every token to be an AST node, I instead want one node for a sequence of text, then one node for a new line, then a text node again, etc. Surprisingly, the code for this turned out much more elegant than I had anticipated. I don't know how to describe it without going into detail about the code, which I do not want to do, but what I realized was that a node only has to be added if the token is a known token, e.g. new line, or it is not and we are not currently reading text.
With this code working, I had an easy time adding URL support. I did not want to have to write an URL tag everytime I wrote an URL, instead I wanted the parser to just match "http" and then recognise that the rest of this token is a URL. This works for http links and https, but it also meant that I had to have specific code for URLs, which wasn't too bad.
At this point I had no styling information on my blog, and I didn't want any due to load times. However I came upon this - http://bettermotherfuckingwebsite.com/ - and despite the fact that I consider that website very ugly, which makes their arrogance look laughable, I realized that I'm an outlier and that some of the things written there were true. I get this because I've seen the reactions, and heard the opinions of people around me. Personally I really like https://danluu.com and http://nothings.org
One thing from the "better website" website, I consider a good argument is that people are less inclined to read something, if each sentence is as big as the width of your screen. It feels like you are taking on an arduous task.
However, as I said, I consider their example ugly, and that's mostly because of how little of the width they use. So I decided instead to use something I learned lightly about quite some time ago, the golden ratio. The golden ratio is supposedly the ratio something has to have, compared to the full size, so that humans enjoy looking at it. More importantly it gave me a goal, and so I set out to calculate what it means for my blog. The definition for the golden ratio is (a + b)/a = a / b, and I know that (a + b) = 1, because that's 100%, which is the entire screen. So I solved 1/a = a/b, by observing that b = a^2 solves it, and then I put a + a^2 = 1 into WolframAlpha, and got the two roots. I'm only interested in the positive root, and it turns out it's 0.618... while the golden ratio is 1.618. This means that if (a + b) = 1, then a = (the golden ratio) - 1. Ok, it looks like I way overcomplicated things. But although it looks easy, I can't figure out the easier solution.
The conclusion was that I just went with width = 62%.
I decided to use all the other style information I got from the "better website" website, mostly because I don't want to do this stuff. But I can also see the argument for lighter font colors, because I have been convinced previously about the idea that you want to keep the baseline in a sort of medium, or moderate position. This lets you to make things louder or more silent.
I was convinced about this idea from people on the internet who mix music.
[Link] tags: code NASM assembly breakout
7 May 2018
The next thing I need to do is convert 32 bit floating point RGBA data to byte sized RGBA data. The process should be as simple as multiplying each 32 bit floating point element with 0xFF, converting to integers, and then shuffling the data such that I get packed 8bit RGBA data. With all I learned from converting Farbfeld 16bit RGBA to 32bit floating point, this should come much easier.
My assumption turned out to be mostly true.
Because this conversion converts 32bit values down to 8bit values, I used 4 SIMD registers in parallel, and then OR'd them together in the end, to get one 128bit register with packed 8bit RGBA data.
I wrote a simpler version of my previous tool, to convert 8bit color data to PNG. This helped confirm the algorithm. I then focused on getting it to display correctly to the screen. I got incorrect colors when packing them as RGBA, so I wrote shuffle masks for RGBA, ARGB, BGRA and ABGR, and tested each one until I got the desired result. Turns out FBdev on Linux requires BGRA.
It was nice and reassuring that writing this code was much faster for me. It implies improvement.
[Link] tags: code NASM assembly breakout
3 May 2018
So let's try this...
I've just read https://sites.google.com/site/steveyegge2/you-should-write-blogs about why everyone should blog, and it has given my curiousity a tool to beat my shyness. I'm also working on something interesting, and I found myself quite excited to write about it. I found the link from https://danluu.com
I've been trying to convert a Farbfeld image file, which is 16bit color information packed as RGBA, to 32bit floating point data in memory. The catch is that I'm doing it in NASM, to learn how to write and read assembly. It turns out that this was also the last straw to a very exciting improvement to my skills.
So my first process was to write the entire conversion from 32bit floating point color to 8bit color in NASM, which I do not know how to code in. This resulted in me having two pieces of code I struggled to understand. The way I've been writing this code in NASM is by asking in ##asm on freenode and the intel documentation on instructions and architecture. So I get tidbits of information, and I write the code I think should work, and then try it, until it does. However, this conversion code is more complicated and my methods aren't very good.
I figured out immediately that I need to isolate the first conversion code, and so my first improvement as a programmer presented itself; I wrote a tool in C that converts 32 bit floating point to bytes, and outputs a PNG. This is an obvious thing, but it is one of the first times I drop my project, to write a tool to help me out with the project.
So now I'm reading in the farbfeld data, running it through my guess code, which is the code I guessed would work, and I write it to a file. I then put that file through my new tool, and see how nothing works. This was to be expected, but the way I try to fix it, is by running GDB on my tool written in C, and printing the values I'm processing there while changing the NASM code. But I'm stuck.
Keep in mind that this takes days. I'm only working a few hours some days, because I am also a student.
Here I want to explain what I'm trying to do. Because Farbfeld is 16 bit RGBA data, and I need 32 bit RGBA floating point data, I'm trying to read the Farbfeld data into 128bit SIMD registers, and then expand the 16 bit data to 32 bit, using PSHUFB, convert to floating point using CVTDQ2PS, and then converting the values into the [0;1] interval by dividing each 32 bit floating point by 0xFFFF, the max value for 16 bit. The follow up is then that I can modify the image data, sRGB correct it and finally multiply the values with 0xFF, which gives me 8bit RGBA data.
I do not know how to write NASM, and I do not know how to write SIMD code. So I have these 128bit registers I'm filling with data, using MOVAPS, and I'm trying to write a divisor that will divide each 32 bit element correctly. I'm also trying to shuffle the 16 bit elements correctly, so that they become 32 bit integers with the same value, and I'm trying to convert this to floating point. At this point I don't know if I'm doing any of these steps correctly.
So to simplify my issue, I first simplify my input. Instead of the actual Farbfeld image, I'm now just writing a 128bit constant, and I learned from ##asm that 0x0F0E0D0C0B0A090807060503020100 is the identity for the PSHUFB instruction, which makes sense. However I'm still getting rubbish output, and my process requires that I convert to 32 bit floating point, because that's what my other tool converts to 8bit.
I've only ever looked at data as a hex dump a few times, and each of those times it's just been something I opened up in Vim, put me in Hex mode with :%!xdd and instantely I see if the data makes sense or not. This time is different. This time it doesn't open in Vim, because the file is too large.
To solve this I found wxHexEditor via googling. I was skeptical about it at first, but it read my file instantily and even rereads the file whenever my program writes new data to it. I can say I was pleasantly surprised, and it can be found here - https://www.wxhexeditor.org/
So now that I'm able to read the data directly, I can eliminate most parts of the code and only have the read and write left. This means that I should have a repeating pattern of the constant I'm reading from, and that is what I get. So MOVAPS and MOVDQA are working correctly.
Next I need to learn about the shuffle. Turning the shuffle code on and off gives me the same result, which means that the identity mask is correct. However, at this point I'm the bearer of two incorrect assumptions. So, in my .data section I have these constants,
g_move_16_to_32_shift_mask_1: dq 0x0F0E0D0C0B0A0908,0x0706050403020100
g_move_16_to_32_shift_mask_2: dq 0x0F0E0D0C0B0A0908,0x0706050403020100
I need two masks, because when converting from 16bit to 32bit, I'm doubling the amount of data necessary. So I read the same 16 bit data into two XMM registers, and I shuffle one of the registers with the first mmask, and the other with the second. This way, one of registers will end with the first half of the data, and the other register with the second half.
The PSHUFB instruction takes two operands, where the second operand is the mask. It then shuffles each byte in the first operand, according to the mask. The mask works as follows. Each byte of the mask determines what data will be put into the byte, of the same position in the first operand. Also if the most significant bit, of any byte is 1, then that byte in the first operand will be zero. The least significant 4 bits, of each byte, determine which other byte will be copied into the location of the current byte. Notice that 4 bits max out at 15, so there are 16 options, and in 128 bit registers, there are 16 bytes.
So my first incorrect assumption was that the least significant 4 bits of a byte, determined where the current byte's value from the first operand, would be copied to. Secondly, and anyone who's written in NASM before might have caught this, my constant is incorrect. I wanted 0x0F0E0D0C0B0A090807060503020100, but what I've put into my constant is 0x070605030201000F0E0D0C0B0A0908. Which means that when I thought I had proven that I had the correct identity map previously, I really messed up.
I spent some time in my new hex editor and tried different things with the mask constant, until I understood how it worked, and then I came up with this result,
g_move_16_to_32_shift_mask_1: dq 0xFFFF0302FFFF0100,0xFFFF0706FFFF0504
g_move_16_to_32_shift_mask_2: dq 0xFFFF0B0AFFFF0908,0xFFFF0F0EFFFF0D0C
As it turns out, once I got the shuffle correct, everything (almost) worked, because the division and conversion code was already correct. So, why almost. Well about the last 1/3 of the data is rubbish data.
So first I verify that writing to PNG is correct, and it is. Then I verify that my size math is correct, and it all looks correct. If so, then the input data must be incorrect. To test this, I read again from a constant, and it turns out that the entire image is written correctly if I just write a constant.
This leads me to my next debug attempt. I make the input image be only 0xFF0000FF, and then I run a INT3 when my conversion code gets anything other than 0xFF0000FF. (Krita doesn't let me set color values, so I switch to The Gimp. I switch between these two all the time, and I don't know why. I really need better tools)
So I have a linear memory allocator, and I allocate memory for the 32 bit floating point data, and then for the Farbfeld file. I hadn't allocated enough for the first part, so I got a race condition. Also I was aligning the loop counter to 16 bit, which was an error.
The code for the test was,
movaps xmm6,xmm4 ; for testing
pcmpeqq xmm6, [g_test_value]
sub rsp, 16
movdqu [rsp], xmm6
mov rax, [rsp]
cmp rax, 0xFFFFFFFFFFFFFFFF
mov rax, [rsp + 8]
cmp rax, 0xFFFFFFFFFFFFFFFF
So now I have working code for converting Farbfeld to 32 bit floating point. Next up is conversion from floating point to bytes, and then in the future I want to make optimization tests. But right now I don't have the code for timing it.